When Cloud LLMs Aren't an Option: Local Inference for Regulated Database Workloads

April 29, 2026—

There is a small but important class of database environments where you cannot use cloud-hosted AI without first having a long, slow conversation with your compliance officer. If your work falls under HIPAA, NERC CIP, GDPR Article 28, or a similar regime, the question of "should we use an LLM in monitoring" is downstream of "can we send query plans to an external API at all."

The honest answer to that second question, for most regulated environments, is "not without paperwork."

The compliance shape, in three flavors

HIPAA. Protected Health Information includes anything that could re-identify a patient, and that bar is low. Schema names like PatientEncounters or RxFulfillment are PHI-adjacent depending on interpretation. A query plan that references those tables is, at minimum, a vendor-review trigger. Without a Business Associate Agreement in place, sending it to an external LLM provider is a violation.

NERC CIP. Electric utilities operating bulk power systems classify configuration and operational data about grid assets as Protected Cyber Asset information. The standard does not call out "LLM inference APIs" by name — it predates them — but the underlying principle is unambiguous: protected information must not traverse uncontrolled paths. During audit periods, "we send this to OpenAI" is the kind of sentence that ends careers.

GDPR Article 28. Any time personal data of EU residents reaches a processor, you need a Data Processing Agreement, you need to record the processing in your Article 30 register, and you need to have done due diligence on the processor's subprocessor list. This is doable for cloud LLM vendors. It is also paperwork your DPO renews every time the vendor adds a new subprocessor or changes their data residency.

In every case, the constraint is not on the LLM itself. It is on the data that ends up in the model's context. If the data is regulated, the path to the model has to be controlled.

The seam: AI as developer tool vs. AI in the production data path

It is worth being precise about where local inference is necessary versus where cloud LLMs are fine. The line is whether regulated data ends up in the model's context.

Cloud is fine for: writing T-SQL templates, generating monitoring scripts, drafting documentation, reviewing code logic, brainstorming index strategies in the abstract. The prompts are technical, not customer data. GitHub Copilot, Claude, GPT — all daily tools.

Cloud is not fine for: analyzing a real query plan from a regulated table, summarizing a wait-stat snapshot from a production HIPAA workload, reviewing a deadlock graph that names protected tables, or any monitoring loop that runs against live data. That data needs to stay inside the network.

The same engineer can use both. The seam is what is in the prompt.

The hardware reality is unglamorous and works

The interesting thing about running LLMs locally is how unsexy the hardware ends up being. For SQL Server diagnostic analysis — a constrained, well-defined task — a 14B-32B parameter model on a single consumer GPU is competitive with the largest cloud models. The task does not require general-purpose intelligence. It requires reading wait-stats JSON, recognizing patterns, and producing a structured diagnostic.

A used RTX 3090 with 24 GB of VRAM will run a 32B model comfortably and costs roughly the same as four months of moderate cloud LLM usage. An RTX 3060 with 12 GB will run 14B models. Either is enough for the job.

Ollama makes the runtime trivial

The systemd unit file is twelve lines:

# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama LLM Server
After=network.target

[Service]
ExecStart=/usr/local/bin/ollama serve
Restart=always
User=ollama
Environment="OLLAMA_HOST=0.0.0.0:11434"

[Install]
WantedBy=multi-user.target

Pull a model (ollama pull qwen2.5:32b), point your monitoring code at localhost:11434, and you have a private inference layer. No vendor agreement. No data egress to explain to an auditor. The audit answer becomes one sentence: "We process this data on our own hardware using open-weight models."

What you give up — and what you don't

You give up the absolute frontier. The largest models from OpenAI, Anthropic, and Google are larger than what you can run on a consumer GPU, and on truly open-ended reasoning tasks they are still better.

For diagnostic work, you do not lose much. The task is constrained. The model is reasoning over a structured payload — a couple of dozen wait stats, five top queries, memory pressure metrics — not writing a novel. Within that envelope, a well-prompted local model performs comparably to a frontier model and stays inside the network boundary.

You also gain something cloud cannot give you: token-cost certainty. Continuous monitoring loops at one-minute intervals across multiple instances would be expensive at cloud-API rates. The marginal cost of one more inference cycle on owned hardware is electricity.

Where to start

The smallest reasonable first step for adding an LLM to a regulated monitoring stack is:

Buy or repurpose a GPU host. An RTX 3060 12 GB is enough to begin.
Install Ollama. Pull qwen2.5:14b or gemma2:9b.
Wire your existing diagnostic queries to send their JSON output to the model with a tight prompt. Get a feel for what the model gets right and wrong.
Do not act on the output yet. Treat the local LLM as an interpreter for the data you already collect.

That alone is enough to retire the weekly 200-page report. The full architecture — sensor / reasoning / actuator separation, the confidence-gated apply path, the snapshot-before-mutate rule — is documented in Part 2 of The Birth of Bob, which walks through what it took to put one of these systems in front of a production SQL Server.