Why AI Agents Shouldn't Write SQL

AI agents are querying production databases across every company. But letting LLMs generate SQL is a reliability, security, and governance problem that context injection can't solve.

The new default is broken

Somewhere in the last eighteen months, it became normal to let AI agents query production databases. Customer support agents look up order history. Analytics agents generate revenue reports. Internal tools agents pull data for dashboards.

The pattern is always the same: give the LLM your database schema, maybe some documentation, and let it generate SQL on the fly.

It works. Until it doesn't.

The three-agent problem

Here's a scenario that plays out in every company using AI agents for analytics:

You ask three different agents the same question: "What was our revenue last quarter?"

  • Agent A says $1.2M. It included refunds in the total.
  • Agent B says $1.4M. It counted pending transactions as revenue.
  • Agent C says $980K. It filtered to a single product line because the schema was ambiguous.

Same database. Same question. Three different SQL queries. Three different answers.

The worst part? You don't know which one is wrong. You might not even know they disagree, because each agent is running independently, each generating its own interpretation of what "revenue" means.

Why context injection doesn't fix this

The industry's answer to this problem has been to add more context. Feed the LLM your schema definitions. Inject column descriptions. Add data dictionaries. Embed documentation. Build retrieval systems that surface relevant context at query time.

This is the approach behind every "text-to-SQL" product and every RAG-based analytics system: give the model enough context, and it will generate the right SQL.

The problem is fundamental: you're making guessing better, but you're still guessing.

No matter how much context you inject, the LLM is still:

  • Interpreting what "revenue" means based on schema hints
  • Generating SQL that it believes matches the intent
  • Hoping that its JOIN logic, WHERE clauses, and aggregations are correct

Context makes the guess more educated. It doesn't eliminate the guess.

The five real risks

1. Inconsistent results

When SQL is generated on the fly, the same question can produce different queries depending on the model version, the context window contents, the temperature setting, or even the phrasing of the question. There is no guarantee of determinism.

In analytics, inconsistency is worse than being wrong. If the number is always wrong in the same way, you can calibrate for it. If it's different every time, you can't trust anything.

2. Silent failures

Generated SQL almost always runs. It produces a result set. It returns numbers. The problem is that the numbers might be subtly wrong: a missing filter, a wrong JOIN condition, an aggregation that doesn't match the business definition.

These failures are silent. The query executes successfully. The agent responds confidently. The user has no way to know the answer is wrong without independently verifying it, which defeats the purpose of having an agent.

3. Security exposure

Every generated SQL query is a potential security risk. The LLM might:

  • Access tables it shouldn't (PII, salary data, credentials)
  • Construct queries that leak information through clever JOINs
  • Generate queries that are technically read-only but expose sensitive aggregations

You can add guardrails: schema restrictions, query validation, PII detection. But you're building a security system around an unpredictable input generator. The attack surface is the entire SQL language.

4. No auditability

When an agent generates SQL, you can log the query. But you can't explain why that specific query was generated, whether it correctly represents the user's intent, or whether a different phrasing would have produced a different (and potentially more correct) query.

For regulated industries, this is a non-starter. For any company that cares about data governance, it's a serious gap.

5. No versioning

Business definitions change. "Revenue" might include a new product line starting Q3. "Active users" might be redefined from 30-day to 7-day activity. When SQL is generated on the fly, there's no way to version these definitions, no way to ensure all agents pick up the new definition simultaneously, and no way to roll back if the new definition is wrong.

The alternative: metrics, not queries

What if agents didn't write SQL at all?

Instead of generating queries, agents would select from a catalog of pre-defined metrics. Each metric is:

  • Defined once by the data team, in SQL they've written, tested, and reviewed
  • Versioned so changes are tracked and can be rolled back
  • Deterministic so the same metric with the same filters always returns the same result
  • Governed with PII masking, row limits, and access controls baked in

The agent's job changes from "figure out the SQL" to "pick the right metric." This is a fundamentally different problem, and it's one that LLMs are actually good at: understanding intent and mapping it to a known set of options.

# What the agent calls
query_metric("total_revenue", filters={"time_start": "2026-01-01"})

# What executes (written by the data team, not the LLM)
SELECT SUM(total_cents) / 100.0 AS revenue_usd
FROM orders
WHERE status = 2
AND created_at >= '2026-01-01'

The LLM never sees the SQL. It never writes a query. It selects a metric by name and applies type-safe filters. The SQL is pre-defined, tested, and versioned.

Probabilistic vs. deterministic data access

This is the core distinction:

Probabilistic data access (context-based SQL generation) gives you a system where more context leads to better guesses, but the output is inherently non-deterministic. The same question might produce different SQL tomorrow.

Deterministic data access (metric selection) gives you a system where the same input always produces the same output. The agent selects a metric. The metric maps to tested SQL. The result is the same every time.

Don't help AI guess your data. Remove the guess.

Where AI should (and shouldn't) be involved

LLMs are exceptional at:

  • Understanding intent: "What's our revenue?" maps to the total_revenue metric
  • Reasoning over results: "Revenue is $4,298, down 12% from Q3, concentrated in the enterprise tier"
  • Conversational interaction: Asking follow-up questions, applying filters, comparing time periods

LLMs are not reliable at:

  • Writing correct SQL against production schemas
  • Understanding business definitions that aren't explicitly documented
  • Producing consistent results across different sessions and contexts

The right architecture keeps AI in the reasoning layer and removes it from the data access layer. Let the LLM do what it's good at. Don't ask it to do what it's not.

What this looks like in practice

With a deterministic data access layer, the flow becomes:

  1. User asks: "What was our revenue last quarter?"
  2. Agent understands intent, selects total_revenue metric with a time filter
  3. Metric layer executes pre-defined SQL, returns $4,298.00
  4. Agent reasons: "Revenue is $4,298, down 12% from Q3. The drop is concentrated in the enterprise tier."

The data is deterministic. The reasoning is AI-powered. Each layer does what it's best at.

Three agents asking the same question? They all get $4,298.00. Because they're all calling the same metric, which maps to the same SQL, which returns the same result.

That's the difference.


OnlyMetrix is a deterministic data access layer for AI agents. It's MCP-native, works with Snowflake, PostgreSQL, and ClickHouse, and it's in beta now. Get started.

Ready to try deterministic data access?

Define your first metric. Let agents query safely.

Get Started