Deterministic vs. Probabilistic Data Access for AI Agents

Two fundamentally different approaches to giving AI agents access to your data warehouse. One generates SQL on the fly. The other removes SQL generation entirely. Here's why it matters.

Two paradigms, one problem

Every company giving AI agents access to data is making an architectural choice, whether they realize it or not. That choice determines whether their agents produce reliable, consistent results or whether they produce answers that are right most of the time but wrong in ways that are impossible to predict.

The choice is between probabilistic data access and deterministic data access.

These aren't marketing terms. They describe fundamentally different architectures with fundamentally different reliability characteristics. Understanding the distinction is critical for anyone building production AI systems that touch real data.

Probabilistic data access

This is what most companies are building today. The architecture looks like this:

User / Agent
    |
   LLM
    |
Context Injection (schemas, docs, lineage, embeddings)
    |
Generate SQL
    |
Database
    |
Result (maybe correct)

The LLM receives the user's question, gets injected with context about the database (schema definitions, column descriptions, documentation, example queries), and generates SQL that it believes will answer the question.

This approach has gotten remarkably good. With enough context, modern LLMs can generate correct SQL for a wide range of queries. Products in this category include text-to-SQL engines, RAG-based analytics tools, and context-enriched agent frameworks.

Why it's called probabilistic

The output is probabilistic because the SQL is generated, not selected. Given the same question:

  • Different model versions may generate different SQL
  • Different context window contents may lead to different interpretations
  • Different temperature settings produce different outputs
  • Even the same model with the same context may occasionally vary its output

The system is non-deterministic by design. More context improves the probability of correct output, but it cannot guarantee it.

The context arms race

The natural response to probabilistic failures is to add more context. If the model gets "revenue" wrong, add more documentation about what revenue means. If it JOINs incorrectly, add schema relationship metadata. If it misses a filter, add business rules to the context.

This creates an arms race:

  1. Agent generates wrong SQL
  2. Team adds more context to prevent that specific error
  3. Agent generates a different wrong SQL for a different query
  4. Team adds more context
  5. Repeat

Each round of context injection fixes specific failure modes while potentially introducing new ones (context conflicts, attention dilution, stale metadata). The system gets more complex without getting fundamentally more reliable.

The ceiling

There is a ceiling on how reliable probabilistic data access can be, and it's determined by the fundamental nature of language model inference. Even with perfect context, the model is still performing a generative task: constructing a SQL query character by character based on learned patterns.

For many use cases, this ceiling is high enough. Exploratory data analysis, ad-hoc questions, research queries where approximate answers are acceptable: probabilistic data access works well here.

For production systems, recurring reports, customer-facing analytics, regulated environments, or anywhere that consistency matters: the ceiling is a problem.

Deterministic data access

The alternative architecture removes SQL generation entirely:

User / Agent
    |
Select Metric
    |
Pre-defined SQL (written by data team, tested, versioned)
    |
Database
    |
Deterministic Result

The agent doesn't generate SQL. It selects from a catalog of pre-defined metrics. Each metric maps to a specific SQL query that was written by the data team, tested against known results, and versioned in source control.

Why it's called deterministic

The output is deterministic because the mapping from metric name to SQL is fixed. total_revenue always maps to the same query. The same metric with the same filters always produces the same result. There is no generation step, no probabilistic inference, no possibility of variation.

This isn't a limitation. It's the entire point.

What the agent actually does

In a deterministic system, the agent's job changes from "generate SQL" to "understand intent and select the right metric." This is a classification problem, not a generation problem, and it's one that LLMs handle with very high reliability.

# User: "What was our revenue last quarter?"

# Agent's job: understand intent → select metric → apply filters
query_metric(
  name="total_revenue",
  filters={"time_start": "2026-01-01", "time_end": "2026-04-01"}
)

# Result: $4,298.00 (same every time)

The agent uses its AI capabilities for what AI is good at: understanding natural language, mapping intent to options, reasoning over results. It doesn't use AI for what AI is unreliable at: writing SQL.

The role of semantic search

One concern with deterministic systems is discoverability: how does the agent know which metrics are available? This is where semantic search plays a critical role.

Instead of generating SQL from scratch, the agent searches the metric catalog by meaning:

  • "How much money are we making?" matches total_revenue
  • "Who's about to leave?" matches churn_risk
  • "Best selling products" matches top_products

The search is deterministic too: the same query always returns the same ranked list of metrics. The agent then selects from that list and applies appropriate filters.

Comparing the two approaches

Consistency

Probabilistic: The same question may produce different SQL across sessions, agents, or model versions. Two agents asking "what's our revenue?" may return different numbers.

Deterministic: The same metric with the same filters always returns the same result. Two hundred agents asking "what's our revenue?" all get the same number.

Auditability

Probabilistic: You can log the generated SQL, but you can't explain why that specific query was chosen over alternatives. The reasoning is embedded in the model's weights.

Deterministic: Every query maps to a named metric. The SQL is visible, versioned, and reviewed. You know exactly what ran and why.

Security

Probabilistic: The attack surface is the entire SQL language. You need guardrails to prevent the model from accessing unauthorized data, constructing injection attacks, or leaking information through clever queries.

Deterministic: Agents can only execute pre-defined metrics. The attack surface is the metric catalog, which the data team controls. PII masking, row limits, and access controls are baked into each metric definition.

Governance

Probabilistic: Business definitions are implicit in the context. If the definition of "revenue" changes, you need to update documentation, embeddings, and hope all agents pick up the change.

Deterministic: Business definitions are explicit in the metric SQL. Change the metric, and every agent immediately uses the new definition. Version history tracks every change.

Flexibility

Probabilistic: Highly flexible. Can answer any question that can be expressed as SQL, even novel queries the system has never seen.

Deterministic: Limited to defined metrics. If a metric doesn't exist, the agent can't answer the question. (Though it can tell the user exactly what metrics are available, which is often more useful than a potentially wrong generated answer.)

This is the real trade-off. Probabilistic systems are more flexible. Deterministic systems are more reliable. For production use cases, reliability wins.

The hybrid path

These approaches aren't mutually exclusive. A production-grade system might use:

  • Deterministic access for all standard metrics (revenue, churn, pipeline, support tickets). These are the questions that get asked repeatedly and need consistent answers.
  • Probabilistic access for exploratory, ad-hoc queries from internal data analysts who understand the caveats and can validate the output.

The key insight is that the default should be deterministic. SQL generation should be an explicit opt-in for users who understand the trade-offs, not the default path for every agent.

How to implement deterministic data access

The implementation is straightforward:

1. Define metrics

metrics:
  - name: total_revenue
    description: Total paid revenue in USD
    sql: |
      SELECT SUM(total_cents) / 100.0 AS revenue_usd
      FROM orders WHERE status = 2
    tags: [revenue, finance]
    importance: 10
    canonical_questions:
      - how much money are we making
      - what is our revenue
      - total earnings
    time_column: created_at

Each metric has a name, a human-readable description, the actual SQL, tags for categorization, and canonical questions that power semantic search.

2. Connect your warehouse

Point the metrics layer at your data warehouse. Snowflake, PostgreSQL, ClickHouse. The metrics layer handles connection management, query execution, result formatting, and PII masking.

3. Expose via MCP

The metrics layer runs as an MCP (Model Context Protocol) server. Any MCP-compatible agent connects and gets access to the metric catalog through standard tool calls:

  • list_metrics — discover available metrics
  • search_metrics — find metrics by intent
  • query_metric — execute a metric with filters

No custom integration code. No SDK-specific bindings. Any agent that speaks MCP can use the metrics.

When to choose which

Choose deterministic data access when:

  • Agents serve end users who can't validate SQL correctness
  • Consistency matters: the same question should always get the same answer
  • You operate in regulated industries (finance, healthcare, legal)
  • Multiple agents need to agree on business definitions
  • You need a complete audit trail of what data was accessed and why

Choose probabilistic data access when:

  • Users are data analysts who can validate generated SQL
  • Queries are exploratory and ad-hoc
  • Approximate answers are acceptable
  • The schema is simple enough that generation is reliable
  • Speed of development matters more than consistency

The bottom line

The industry has defaulted to probabilistic data access because it's easier to build and demo. Give a model your schema, watch it write SQL, ship it.

But "it works in the demo" is not the same as "it works in production." Production data access needs consistency, auditability, security, and governance. These properties are inherent to deterministic systems and bolted-on (at best) to probabilistic ones.

If you're building AI agents that touch production data, the question isn't whether your context injection is good enough. The question is whether you should be generating SQL at all.

Don't help AI guess your data. Remove the guess.

OnlyMetrix implements deterministic data access for AI agents. MCP-native, works with Snowflake, PostgreSQL, and ClickHouse. Currently in beta. Try it.

Ready to try deterministic data access?

Define your first metric. Let agents query safely.

Get Started