Schema Archaeology: Letting an LLM Read 20-Year-Old Stored Procedures Faster Than You Can
Most DBAs do not spend their days designing databases. They spend their days operating databases that someone else designed, often a long time ago, often with conventions that made sense at the time and don't anymore. The technical term for understanding what you have inherited is "schema archaeology." It is the work of recovering intent from artifacts. Until recently it was a thankless multi-day exercise. With a language model in the loop, it is an afternoon.
Why archaeology gets deferred
Schema archaeology is the work that nobody assigns. It does not appear on a sprint board. It does not have a JIRA ticket. It is the silent prerequisite for every other piece of work, and it gets deferred forever because deferring it does not break anything — until it does.
The traditional approach is: open the stored procedure list, read each procedure, read the tables it touches, read the upstream procedures that call it, build a mental model, lose the mental model when you context-switch, rebuild it on Tuesday. Three days later you have a fragile understanding that evaporates the moment you take vacation.
Language models change this. Not because they understand your business. Because they are tireless readers and decent at compressing what they read.
The starter task: stored procedure summaries
The most concrete first use is generating one-paragraph summaries of every stored procedure in the database. Roughly:
- List every procedure:
SELECT s.name AS schema_name, p.name, m.definition FROM sys.sql_modules m JOIN sys.procedures p ON m.object_id = p.object_id JOIN sys.schemas s ON p.schema_id = s.schema_id. - For each definition, send it to a model with the prompt: "Summarize what this stored procedure does in three sentences. Then list the tables it reads, the tables it writes, and any non-obvious side effects."
- Save the output to a markdown file or a wiki page keyed by procedure name.
This is a couple of hours of work. The output is a database directory. Even an imperfect directory beats a database with no directory at all. Three of the first five procedures you investigate this way will turn out to do something different from what their name suggests, which is on its own a meaningful finding.
Building dependency lineage
Once you have the summaries, the next move is dependency lineage: which procedures call which, which procedures touch which tables, where the side effects live.
You can ask the model directly to enumerate dependencies from the procedure body. You can also use SQL Server's built-in sys.sql_expression_dependencies and feed the output into the model for human-readable annotation. The combination is more useful than either alone — the catalog gives you ground truth, the model gives you context.
The output that pays off is a "what-breaks-if-I-change-this" report for each table. When the application team asks "can we drop the LegacyOrderStatus column," you have a one-page answer instead of a week of investigation.
Business rules buried in T-SQL
The hardest schema archaeology is recovering business rules that live in trigger logic, computed columns, or the inside of cursor loops. These are the kinds of rules that nobody documented because the person writing the code thought they were obvious. They are not obvious, three years later, when you are trying to figure out why the audit total never matches the source total.
Hand the model the trigger and ask: "What business rules does this trigger encode? List each as 'when X happens, then Y.'"
The output is wrong sometimes. It will occasionally invent a rule that is not in the code. You verify by reading the trigger. But it generates a starting list of rules to verify, which is faster than starting from blank.
The accuracy tradeoff is favorable
DBAs reasonably worry about LLMs hallucinating SQL behavior. The worry is real. The mitigation is the workflow, not the model.
For schema archaeology, the workflow is: model output is a draft, the DBA verifies against the actual code, the verified version goes to documentation. The model is a faster reader, not a more accurate one. The DBA is still the source of truth.
The accuracy bar to clear is not "more accurate than a human." It is "useful as a first pass." A first pass that is 80% correct and takes ten minutes per procedure beats a first pass that is 100% correct and takes ninety minutes per procedure, when the alternative is no first pass at all.
Where this stops working
LLM-based archaeology breaks down when the procedure is generated code (templated, machine-written, full of repeated boilerplate that confuses the model) or when the procedure relies on external state the model can't see (linked servers to systems with different schemas, OPENQUERY calls, dynamic SQL that builds itself from configuration tables).
For those cases, the model still helps as a co-reader, but you cannot rely on its summary without significant verification. The flag is when the model says "this procedure does X, Y, and Z" but the body is mostly EXEC sp_executesql @sql with @sql built from a string. In those cases you have to either feed it the full execution context or do the archaeology by hand.
A practical first afternoon
If you have a database you have inherited and never properly understood, here is a half-day plan:
- Hour 1. Generate the procedure list with metadata. Filter out anything obviously generated.
- Hours 2-3. Run each through a model with the summary prompt. Use a local LLM if your schemas are sensitive (see the local-inference post for why). Save outputs.
- Hour 4. Read the summaries. Mark the surprises.
- Hour 5. For the surprises, dig in by hand. The model gave you a starting hypothesis; verify it against the actual code.
You will end the day with a directory you did not have at breakfast, and a shortlist of the procedures that most need a human's attention. That is enough to turn a database you inherit-and-fear into a database you inherit-and-operate.
The full reasoning architecture — sensors, reasoning, actuators, with the gates that keep an AI's hands off mutation — is in The Birth of Bob. But you can do schema archaeology without any of it. You just need a model and an afternoon.