Anomaly Detection: When to Reach for Statistics, When to Reach for AI

May 07, 2026—

The standard SQL Server alerting setup is a threshold over a metric. CPU above 80% for 5 minutes. Replica lag above 10 seconds. Database file growth above 10%. These alerts are useful for the failure modes you anticipated. They are useless for the failure modes you didn't anticipate, which are the ones that actually hurt.

The case I keep coming back to: a customer database where query latency drifted from 80ms p99 to 220ms p99 over six weeks. No single threshold ever fired. No single deployment correlated with the drift. By week six, the application team had escalated and we spent a Tuesday afternoon untangling a slow accumulation of plan regressions across three procedures, each individually below the alert threshold, collectively causing a 2.75x latency increase.

This is the anomaly problem. Threshold alerts are a sieve with very large holes.

Statistical anomaly detection — the unsexy step that pays

Before reaching for LLMs, reach for statistics. Anomaly detection at the metric level has a 30-year body of work behind it, and it is computationally cheap.

The simplest useful method: compute a rolling 30-day baseline for each metric you care about, with a standard deviation. Anything more than 3 sigma from the baseline is an anomaly. Anything that has crossed 2 sigma for more than 60 minutes is a drift signal.

For seasonal metrics — queries-per-second has weekday/weekend cycles — use STL decomposition or an ARIMA model. The Python statsmodels library does both, and the runtime cost on a 30-day metric series is measured in tens of milliseconds.

What you get from statistical anomaly detection is a list of "this metric is behaving differently than its own history." That alone catches drift no threshold catches.

What it doesn't do is explain the anomaly. An ARIMA model doesn't know that the latency drift is correlated with three plan regressions. It just knows latency is up.

LLM-based anomaly detection — useful for the explanation, not the detection

The instinct when you hear "AI for monitoring" is to put the AI on the front line of detection. That is wrong. LLMs are bad at numerical anomaly detection. They cannot reason about a 30-day time series in a token-efficient way, and the cost of feeding them all the data is prohibitive.

The right place for LLMs is the explanation layer, after the statistical detector has already flagged something interesting.

The flow that works:

Statistical detector flags an anomaly. "p99 latency on db-prod-04 has been above the 2-sigma threshold for 73 minutes."
The orchestration layer pulls the relevant context: top queries by elapsed time, recent plan changes from sys.dm_exec_query_stats, recent index changes, recent deployments, wait stats deltas.
The LLM gets the structured context and the anomaly description. Prompt: "Given this anomaly and this context, what is the most likely cause? List the top three hypotheses with the diagnostic step that would distinguish between them."

The output is a triage starting point with explicit next queries to run. This is qualitatively different from "p99 latency is up" — it is "p99 latency is up, and the most likely cause is plan regression on sp_GetOrderHistory because its avg_logical_reads tripled while its execution_count is unchanged. Pull sp_GetOrderHistory's plan from cache and diff it against the version from last week."

Plan-regression detection is the highest-value example

Plan regression — when a stored procedure's execution plan changes for the worse — is the canonical hard-to-catch problem. Threshold alerts don't fire because the procedure has not fully fallen over yet. The procedure team is not watching it because it is still running. By the time the latency hits the user, the regression has been there for days.

Detection: hash each query plan from sys.dm_exec_query_plan weekly. When a procedure's plan hash changes, log the old plan and the new plan side-by-side.

LLM analysis: feed both plans to the model. "The plan hash changed for sp_GetOrderHistory between 2026-04-15 and 2026-04-22. The old plan has a clustered index seek; the new plan has a clustered index scan. Why did the optimizer change its mind?"

The model is good at reading plan XML and identifying which operator is different and what it suggests. Common findings: parameter sniffing (a different leading parameter caused a different plan), statistics drift (statistics were updated and the cardinality estimate changed), index usage shift (an index was dropped or modified).

This is detection that no threshold catches and no statistical model explains. It is also the kind of regression that, three days from now, becomes a 2 AM page.

When statistics is the right tool

For metrics with clear seasonality and clear baselines — CPU, memory, queries per second, replication lag — statistical methods are the right tool. They are cheap, well-understood, and don't hallucinate.

LLMs add no value to detection on these metrics. They add a lot of value to the post-detection explanation, but the detection itself should stay in the statistical layer.

When the LLM is the right tool

For pattern detection over structured-but-irregular data — query plans, slow query lists, deadlock graphs, error log entries — the LLM is the right tool. These are not time series; they are documents. Embeddings and LLM reasoning are designed for documents.

A specific high-value example: deadlock graph analysis. SQL Server logs deadlock graphs as XML. They are unreadable to most humans without significant practice. Feed the deadlock graph to a model and ask "which two queries are involved, what is each holding, and what is each waiting for, and what is the smallest change that would prevent this deadlock." The model is good at this. The output is a triage paragraph in plain English instead of an XML blob nobody on the team has read in two years.

The pattern that holds up

The default architecture for AI-assisted monitoring is not "AI does everything." It is:

Statistical detection for time-series metrics. Rolling baselines, sigma thresholds, ARIMA for seasonality.
Pattern detection via LLM for structured documents. Plan XML, deadlock graphs, slow-query texts.
LLM explanation for any anomaly the statistical layer flags, gated on having enough context to reason.
Human approval for any action that follows.

You will catch drift that thresholds miss, you will get explanations the statistics cannot provide, and you will not pay LLM tokens for every metric tick. The cost stays bounded.

The full architecture is in The Birth of Bob, specifically the sensor / reasoning / actuator chapters. The short version: detection is cheap, explanation is the bottleneck, and the LLM goes in the bottleneck.