We Ran the Same Benchmark as dbt Labs. Here's What They Didn't Measure.

20 analytics questions. 60 agent calls. Two strategies: SQL generation vs. governed metrics. The results tell a story about fabrication, not accuracy.

The context

In March 2026, Claire Gouze published the most honest account of making a semantic layer work for analytics agents. She spent four weeks iterating on a dbt MetricFlow setup, moving from 13% to 82% reliability. Four iterations. Real production data. Each step required changes to the data model, not the prompt.

Then dbt Labs published their own benchmark. Using their dbt-llm-sl-bench framework and the ACME Insurance dataset (43 questions about policies, claims, and loss ratios), they tested three strategies: raw text-to-SQL, their Semantic Layer, and their MCP server. The Semantic Layer scored highest. The results looked definitive.

Two experiments. Wildly different results. The difference: dbt Labs built their dbt models specifically to answer their benchmark questions on a controlled dataset. Claire used real production data.

But neither experiment measured the most important question for production use: when the agent gets it wrong, does it tell you, or does it hand you a plausible wrong number?

That is what we set out to measure.

Our methodology

We used the same scoring methodology as both Claire and dbt Labs: exact data diff. The agent's query result is compared row-by-row against gold SQL output. No LLM judge. No semantic matching. Correct means exact match. Wrong means wrong.

Dataset

We deliberately did not use the ACME Insurance dataset from the dbt Labs benchmark. Instead, we used the UCI Online Retail dataset: real e-commerce transaction data with roughly 500K line items across four normalized tables (invoices, customers, products, invoice_items). This is the same dataset OnlyMetrix runs against in production testing.

Why a different dataset? The ACME Insurance benchmark's dbt models were built to answer its 43 test questions. The models and the questions were designed together. Our questions were written before we built the metric layer. No circular advantage. The conditions are harder, and the results are more representative of what happens when a real team deploys a semantic layer against data that was not shaped for a benchmark.

Questions

20 business questions, each run 3 times, for 60 agent calls per strategy. The questions span the range of what analytics teams actually ask:

  • Simple KPI lookups: total revenue, customer count, invoice count
  • Dimension breakdowns: revenue by country, customers by country
  • Time series: monthly revenue, quarterly revenue
  • Derived calculations: revenue per customer, average items per order
  • Filtered aggregations: cancellation rate, churn rate, negative invoice count
  • Multi-table joins: top 10 products by revenue

Strategies

Two strategies tested side by side, same model for both:

SQL baseline: Claude Sonnet 4.6 receives the full database schema (table names, column types, row counts) and generates SQL from scratch for each question. This is the standard text-to-SQL approach.

OM agent: Claude Sonnet 4.6 with access to OnlyMetrix's compiled metric catalog. Instead of writing SQL, the agent calls query_metric with a metric name and optional filters. The SQL backing each metric was written and tested by a human. The LLM never sees it.

The metric they didn't track

We added one measurement that neither Claire's experiment nor dbt Labs' benchmark included: fabrication count.

A fabricated answer is not an error. It is not a refusal. It is the agent returning a plausible number that is factually wrong. The query runs. The response looks confident. The number is incorrect. You have no way to know without checking the answer yourself.

This is the metric that matters most in production.

First results: 50% vs. 65%

We ran the first benchmark on April 10, 2026. The OM agent lost.

StrategyReliabilityCoverageFabrications
SQL baseline (Sonnet)65%100%7
OM agent (Sonnet)50%95%0

The SQL baseline answered every question. Roughly a third of those answers were wrong. The wrong answers looked plausible. The SQL ran without errors. The numbers were formatted correctly. There was no signal that anything was off. Seven fabricated answers out of twenty.

The OM agent scored lower on reliability. Ten questions correct out of twenty. But here is where it gets interesting: the ten failures were not fabrications. Every single failure was one of three things: a metric that did not exist, a backend error the agent could not work around, or the agent selecting a similar but wrong metric. Zero fabricated answers.

The failure modes looked like this:

QuestionAgent pickedFailure reason
How many orders were cancelled?invoices_countNo cancellation metric existed
What percentage are cancellations?invoices_countNo cancellation rate metric
Avg items per order?avg_amountWrong metric (money, not items)
How many unique products sold?(none)No metric existed
Largest single order?(none)No metric existed
Avg orders per customer?(none)No metric existed
Revenue per customer?total_revenueWrong metric (total, not per-customer)
Top 10 products by revenue?top_productsData extraction mismatch
Customers per country?customers_countNo per-country metric
Negative invoice count?invoices_countNo filtered metric existed

Every failure was diagnostic. The agent either said "I don't have a metric for this" or it tried the closest match and got a wrong answer that was traceable to a specific root cause. The SQL baseline's failures were invisible. There was no way to tell which answers were the fabricated seven without already knowing the correct answer.

OM failed loudly. SQL failed silently.

Diagnosing the failures

We found three root causes. Each one is concrete and fixable.

Root cause 1: the GROUP BY rewrite bug

This was the most important finding, and it had nothing to do with LLMs.

Every dimension query was returning HTTP 400. Not because the dimension did not exist. Not because the agent picked the wrong metric. Because of a bug in the SQL rewriter.

When a user queries revenue_by_country with dimension=invoice_date, the backend needs to rewrite the SQL to group by the new dimension. What it was doing instead was appending the new GROUP BY column without removing the old one:

-- What the rewriter produced (broken):
SELECT country, SUM(total_amount)
FROM invoices
GROUP BY country, invoice_date
-- PostgreSQL error: "country" must appear in GROUP BY
-- or be used in an aggregate function

-- What it should have produced:
SELECT invoice_date, SUM(total_amount)
FROM invoices
GROUP BY invoice_date

The fix required walking the SQL AST, classifying each SELECT item as aggregate (SUM, AVG, COUNT) or bare column, keeping the aggregates, and replacing bare columns with the new dimension. Six unit tests written, all failed before the fix, all passed after. This is the kind of bug that benchmarks are supposed to find.

Root cause 2: missing metrics

Six questions had no corresponding metric in the catalog:

  • Cancelled order count (needed COUNT WHERE is_cancellation)
  • Cancellation rate (needed filtered COUNT / total COUNT)
  • Average items per order (needed AVG over a subquery)
  • Largest single order (needed MAX(total_amount))
  • Revenue per customer (needed total/customer ratio)
  • Unique products sold (needed COUNT DISTINCT)

The fix: add these as explicit metrics. Write the SQL, add it to the config, let the compiler discover the structure. No MetricFlow YAML. No hand-written semantic layer definitions. Just SQL and omx.

This is the same finding Claire had in her step 4. She called it "enriching the data model." We found it in one benchmark run instead of four weeks, because the failure table told us exactly which metrics were missing.

Root cause 3: wrong metric selection

Three questions had the agent pick a metric with a similar name but the wrong aggregation type. "Average items per order" mapped to avg_amount (average invoice dollar amount) instead of an items-per-order metric. "Revenue per customer" mapped to total_revenue instead of a per-customer average.

The fix: improve metric descriptions. avg_amount: "Average invoice amount in GBP" disambiguates clearly from items-per-order when the description says what unit the metric measures. One afternoon of work, not weeks.

Final results: 100% vs. 70%

After fixing all three root causes, we reran the benchmark on April 13, 2026. Same 20 questions, same 3 iterations per question, same model.

StrategyReliabilityCoverageFabrications
SQL baseline (Sonnet 4.6)70%100%18
OM agent (Sonnet 4.6)100%100%0

The OM agent went from 50% to 100%. Not by changing the model. Not by changing the prompt. Not by changing the agent architecture. By fixing the data layer underneath it: patching a SQL rewrite bug, adding ten missing metrics, improving three descriptions, and aligning one grouping key.

60 out of 60 agent calls returned the correct answer. Every question. Every iteration. The last fix that took us from 95% to 100% was a grouping key mismatch: the top_products metric grouped by (description, stock_code) while the gold SQL grouped by description only, producing different revenue totals for products with multiple SKUs. The fix: redefine the metric to group by description only, matching the business question ("top products by name") not the table structure ("top SKUs"). The benchmark caught the wrong decision immediately.

The SQL baseline told a different story. It improved from 65% to 70% between the two runs (different model versions), but fabricated 18 answers across the 60 calls. The same six questions failed every single time:

  • avg_order_value: wrong aggregation logic, 0/3
  • monthly_revenue: wrong date truncation, 0/3
  • top_products_by_revenue: wrong JOIN logic, 0/3
  • revenue_by_quarter: wrong date grouping, 0/3
  • revenue_per_customer: wrong subquery structure, 0/3
  • orders_per_customer: wrong aggregation level, 0/3

All 18 fabricated answers ran without errors. All returned plausible numbers. All were wrong.

What this means

Three conclusions.

Fabrication matters more than accuracy

SQL at 70% sounds usable. SQL with 18 fabricated answers is dangerous. You do not know which 30% to distrust. Every answer looks the same: a query ran, numbers came back. The wrong ones are indistinguishable from the right ones without ground truth.

OM at 100% with zero fabrications means something different. When it answers, trust it. When it cannot answer, it tells you. The failure mode is visible, not hidden. In production, the difference between "sometimes wrong" and "sometimes silent" is the difference between a tool and a liability.

Agent reliability is a data layer problem

We went from 50% to 100% without changing the model, the prompt, or the agent architecture. We fixed a SQL bug, added ten metrics, wrote better descriptions, and aligned a grouping key. The agent got smarter because the layer underneath it got more correct.

This matches Claire's finding. She spent four weeks on the same insight: the path to reliable agents goes through the data model, not through prompt engineering. We arrived at the same conclusion in one benchmark run because the compiler told us exactly which metrics were missing and the failure table showed us which ones the agent was confusing.

The implication for teams building analytics agents: stop tuning prompts. Fix your data layer. Define the metrics the agent needs. Make the descriptions unambiguous. The agent will do the rest.

Best-case benchmarks hide real-world failure modes

The dbt Labs benchmark tested 43 insurance domain questions against ACME Insurance, with dbt models built specifically for those questions. That measures what is possible under ideal conditions. It does not tell you what happens when the metric catalog has gaps, when descriptions are ambiguous, or when the backend has bugs.

We tested 20 general business questions against real e-commerce data, with questions written before the metric layer existed. Our first run at 50% is a more realistic starting point for most teams. The path from 50% to 100% is clear and mechanical: close coverage gaps, fix backend bugs, improve descriptions, align grouping keys. But you have to run the benchmark to find those gaps.

The fabrication metric is what separates benchmarks from production readiness. A system that scores 70% with hidden fabrications is less useful than a system that scores 50% with honest failures. Honest failures are fixable. Silent fabrications erode trust in ways that are hard to recover from.

Reproduce it

The full benchmark is open source:

git clone https://github.com/dreynow/omx-benchmark
cd omx-benchmark
pip install onlymetrix httpx pyyaml anthropic
python run_bench.py --strategies sql omx_agent \
  --model claude-sonnet-4-6 --iterations 3

The repo includes the 20 questions, gold SQL for each, the scoring logic (exact data diff with 1% numeric tolerance), and the full benchmark harness. The questions are generic enough to adapt to your own dataset. The scoring is deterministic.

Run it against your warehouse. Swap in your own metrics. See where the gaps are. The failure table will tell you exactly what to fix.

Claire posted her results. dbt Labs posted theirs. This is a third data point in that conversation. The question is not whether semantic layers work. The question is whether your semantic layer is complete enough, and whether you are measuring the right thing.

Accuracy tells you how often the agent is right. Fabrication count tells you how much you can trust it.


OnlyMetrix is open source. pip install onlymetrix.

Ready to try deterministic data access?

Define your first metric. Let agents query safely.

Get Started