AI as the On-Call DBA's Second Brain

May 04, 2026—

The page came in at 2:14 AM. The alert said the warehouse OLTP cluster was running 90th-percentile query latency at 4.2 seconds against a usual baseline of 80ms. By the time I had a laptop open, the latency was at 6 seconds and the application team had started escalating to my manager. This is the situation where AI assistance pays its full cost.

The on-call DBA's actual problem

The myth of on-call DBAs is that they know what's wrong instantly because they've seen it before. The reality is that experienced DBAs have seen most of the failure modes individually, but the combination they're looking at right now is usually a combination they haven't seen in this particular configuration. They know the components. They have to compose the diagnosis on the spot.

Composition under sleep deprivation is hard. Working memory shrinks. You miss things you would catch at 2 PM. You overweight the most recent symptom and miss a contributing factor that has been there for hours.

This is the gap an LLM-based assistant is genuinely good at filling. Not at making decisions for you. At reducing the cognitive load of holding ten signals in your head simultaneously.

The runbook reader pattern

The simplest pattern is the most useful: feed the alert into the model and ask it to triage.

Real example I have used at 2 AM verbatim:

"Latency on db-prod-04 spiked to 6 seconds at the 90th percentile, started 11 minutes ago. Wait stats top three are PAGEIOLATCH_SH, ASYNC_NETWORK_IO, and CXPACKET. No deployments in the last 24 hours. Top query by total elapsed is reading from dbo.SalesLedgerHist with no filter on PostingDate. What do I look at first?"

A well-prompted model returns something like:

PAGEIOLATCH_SH at the top with no recent deployment suggests buffer pool eviction; check if the working set has shifted — query sys.dm_os_buffer_descriptors for the last hour.
The unfiltered scan on SalesLedgerHist is suspicious. If a stored procedure recompiled and dropped the date filter, you would see exactly this pattern. Pull the plan from the cache.
ASYNC_NETWORK_IO at this percentile usually means a downstream consumer can't drain the result fast enough; check whether the application connection pool is saturated.

That answer would not replace your judgment. It restructures the problem into three threads, each with a concrete next query. At 2 AM, that is the difference between productive triage and staring at a Grafana dashboard hoping a pattern jumps out.

Postmortem assistance

The other place AI earns its place is the morning after. Postmortems are work that gets deferred because writing them is unrewarding when you would rather sleep.

Feed the timeline into the model and ask it to draft the contributing-factors section. The output will need editing. The output will also be longer and more complete than the version you would have written half-asleep. Same for the action items: "given this incident, what are five action items that would make this not happen again," then prune the model's list to the three that actually fit your context.

The model will not understand which actions you can realistically take given your team's capacity. It will, however, brainstorm five options when you would have brainstormed two.

What to absolutely not do

The on-call use case is also where the temptation to give the AI direct mutating access is highest. "Just let it run the diagnostic queries for me." "Just let it set the affinity mask." "Just let it kill the blocking session."

This is the wrong end of the boundary. The 2 AM situation is exactly when you want a tighter, not looser, separation between reasoning and action. Tired humans approve things they should not. An AI that can mutate state during your worst cognitive hour is a worse safety story than one that can only read.

The pattern that holds up: the AI reads, summarizes, suggests. The DBA executes. The audit trail is unambiguous — every mutation in the system was the human's choice, made on the human's authority, even when the human was triaging at 2 AM with three browser tabs and a mug of stale coffee.

What to set up before your next on-call rotation

Three concrete pieces of preparation, none of which require buying anything:

A pre-formed triage prompt. Save a template that takes the alert, recent deployments, top wait stats, and top queries, and asks for a triage list. Paste it into your chat tool of choice. Do this once, save it, reuse it on every page.
A read-only DMV query bundle. Have a saved set of the diagnostic queries you reach for during incidents. The AI is more useful when you can paste real DMV output back to it, not when you are typing pseudo-data.
A postmortem template. A skeleton with sections for timeline, contributing factors, customer impact, action items. Drop the chat transcript in, ask the model to populate it, edit before publishing.

Even with no further automation, that setup will make your worst on-call shift measurably less bad. The full architecture, including the actuator gating that keeps the AI's hands off mutation during incidents, is documented in Part 2 of The Birth of Bob.