I once buried a deliberate joke about cat photos on page 87 of a 214-page weekly SQL Server health-check report. Fourteen people received it every Monday morning. Nobody mentioned the joke. That is how I knew the report had become a tombstone.

The DBA's actual job is variance detection

The job of a database administrator is not to know everything about a database. The job is to know whether a database is behaving differently today than it was last Tuesday — and if it is, to know whether the difference matters.

That is one question, and it has three honest answers: yes-and-fix, yes-but-benign, and no-and-keep-watching. Most monitoring infrastructure is built to answer everything except that one question. We collect wait stats, query plans, fragmentation percentages, and deadlock graphs. The collection layer is solved. The interpretation layer is where DBAs actually spend their day.

So why do health-check reports keep growing?

Because every time something gets missed, a section gets added. A fragmentation spike that took the warehouse down? Add a fragmentation section. Tempdb growth nobody predicted? Add a tempdb section. A query regression that masqueraded as an application bug for three days? Add a top-query-by-CPU section. Two years of "add a section" later, you have a 214-page document and a distribution list that has trained itself to mark it read.

The structure of the document tracks the anxiety of the person who built it. That is honest. It is also a lousy way to design a monitoring product.

214 pages of data is not information

Volume kills interpretation. Humans habituate. The eye scans for "something wrong," and when nothing jumps out in the first thirty seconds, we close the tab. The report has done the work of data collection but offloaded the cognitive work back onto the reader. That cognition was supposed to be the value-add.

Timing makes it worse. A report that runs at 6 AM Monday is showing the world as it looked thirty hours ago by the time anyone reads it Tuesday morning. If something is quietly degrading across a weekend, the report does not catch it. The application team catches it on Tuesday afternoon when their batch job times out.

SQL Server gives you the answer in real time, if you ask

The Dynamic Management Views are an underused gift. A handful of them carry most of the diagnostic weight:

  • sys.dm_os_wait_stats — what the engine is waiting on, since startup or since the last reset.
  • sys.dm_exec_query_stats joined with sys.dm_exec_sql_text and sys.dm_exec_query_plan — the top regressing queries with their plans.
  • sys.dm_db_index_usage_stats — which indexes are actually read versus carrying weight.
  • sys.dm_os_process_memory — memory pressure as the engine sees it, not as the OS reports it.

You can ask these questions every minute. The cost is negligible. The problem with continuous monitoring has never been the collection. It is the interpretation.

You can wire SQL Server up to Prometheus, Grafana, OpenTelemetry, or a homegrown agent and have a torrent of metrics within an afternoon. What you cannot easily build is the layer that looks at fifteen wait types simultaneously, three regressing queries, and a memory pressure shift, and says: "this is a parallelism problem on a workload that grew past your MAXDOP threshold."

That synthesis is what a senior DBA does when you walk into their office. It is also what they leave with when they take another job.

This is a reasoning problem, not a tooling problem

Rule engines and threshold alerts can detect the failures you already know about. They cannot generalize. Every new failure mode requires a new rule. After enough rules, the rule set becomes its own maintenance burden — you are debugging the alert engine instead of the database.

Language models, applied to structured diagnostic data, can do the synthesis. Not perfectly. They hallucinate. They miss things a senior DBA would catch. They occasionally invent a wait type that does not exist. But they generalize across signal patterns they have not seen in exactly that combination before, and they explain their reasoning in plain English that a developer or a manager can act on without translation.

Used correctly — as a reasoning layer over real-time DMV state, with verification gates and a tight blast radius — they replace the 214-page weekly report with something better: a continuous variance signal that fires when there is something worth paying attention to and stays quiet when there is not.

What to delete from your monitoring stack this week

If your monthly SQL Server health check is over thirty pages, the format has outgrown its purpose. Three suggestions:

  1. Cut the historical context section. The data is in the DMVs. Anyone who needs the history can pull it themselves.
  2. Replace standing sections with exception-driven alerts. A "Memory Pressure" section that prints "OK" 51 weeks out of 52 is worse than nothing — it has trained the reader to skip it.
  3. Move from weekly batch to continuous. The right interval for variance detection is closer to one minute than one week.

The full version of this argument is in Chapter 1 of The Birth of Bob, the book about a local AI agent for SQL Server self-healing that I built because the report problem refused to go away.