March 21, 2026 11 min read Drey Sunday

Harness Engineering for Data Access: Why Your AI Agents Need Constraints, Not More Context

Context engineering optimizes what an AI sees. Harness engineering controls what it can do. For data access, the strongest harness is one that removes SQL generation entirely.

The conversation has shifted

For the last year, the AI engineering community has been focused on context engineering: the practice of designing what information an LLM sees at inference time. Better prompts, better RAG pipelines, better schema injection. The assumption was that if you give the model enough context, it will produce the right output.

That assumption is now being challenged by a more fundamental question: what happens when the model gets it wrong despite having perfect context?

This is where harness engineering enters the picture. Coined by Birgitta Boeckeler on Martin Fowler's site and championed by engineers like Mitchell Hashimoto, harness engineering is the system design of everything that sits outside the model: behavioral constraints, feedback loops, quality gates, and improvement cycles that ensure reliability across thousands of inferences, not just one.

Context engineering asks: what do we show the agent?

Harness engineering asks: what do we prevent, measure, control, and fix?

Both matter. But for data access specifically, the harness is everything.

Why data access is different

Most AI agent tasks have a wide solution space. There are many acceptable ways to write an email, summarize a document, or plan a project. Context engineering works well here because "good enough" is genuinely good enough.

Data access is fundamentally different. When someone asks "what was our revenue last quarter?" there is exactly one correct answer. The SQL to produce that answer has exactly one correct form (or at most a few equivalent forms). The margin for error is zero.

This makes data access a worst-case scenario for context engineering and a best-case scenario for harness engineering.

With context engineering alone, you are optimizing the probability that a generative model will produce the one correct SQL query out of the infinite space of possible SQL queries. You can get that probability very high. You cannot get it to 100%.

With harness engineering, you can constrain the system so that the correct query is the only query that can run. That is a fundamentally different reliability profile.

What harness engineering looks like for data access

Mitchell Hashimoto described his approach to harness engineering as a simple principle: "anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again."

For his Ghostty project, this means an AGENTS.md file where each line corresponds to one corrected behavior. Simple, effective, and iterative.

For data access, you can take this principle much further. Instead of correcting mistakes one by one, you can eliminate the entire class of mistake by removing the generative step entirely.

Here is what a fully harnessed data access layer looks like:

Behavioral constraints

The agent cannot generate SQL. It can only select from a catalog of pre-defined metrics. Each metric maps to a specific SQL query written by the data team, tested against known results, and versioned in source control.

metrics:
  - name: total_revenue
    description: Total paid revenue in USD
    sql: |
      SELECT SUM(total_cents) / 100.0 AS revenue_usd
      FROM orders WHERE status = 2
    tags: [revenue, finance]
    importance: 10

The agent's job changes from "generate SQL" to "select the right metric." This is a classification problem, not a generation problem, and LLMs handle classification with much higher reliability than generation.

Quality gates

Every metric goes through a review process before it enters the catalog. The SQL is written by someone who understands the data model. It is tested against known results. It is reviewed by the team. Only then does it become available to agents.

This is the same quality gate that software engineering applies to production code: code review, testing, and controlled deployment. The only difference is that the "code" is metric definitions instead of application logic.

Deterministic linting

Birgitta Boeckeler specifically calls out "deterministic custom linters and structural tests" as a component of harness engineering. For data access, this translates to:

PII masking: Every query result is scanned for personally identifiable information. Email addresses, phone numbers, and names are masked before the agent sees them.
Schema restrictions: Agents can only access approved schemas. Salary tables, credentials, and internal tools databases are invisible.
Read-only enforcement: Every query is validated as a SELECT statement. INSERT, UPDATE, DELETE, and DDL statements are rejected before execution.
Row limits: Results are capped to prevent accidental data dumps.
SQL validation: The SQL is parsed and validated before execution, even though it was pre-defined. Belt and suspenders.

These are deterministic checks, not probabilistic ones. They do not depend on model behavior. They run the same way every time.

Audit and feedback loops

Every metric query is logged with full attribution: which agent called it, which user triggered it, what filters were applied, how long it took, and whether it succeeded. This creates a complete audit trail that answers two critical questions:

What data did the agent access? (compliance)
Which metrics are agents actually using? (product insight)

The feedback loop closes when the data team reviews query patterns and adds new metrics based on what agents (and their users) are actually asking for. This is Hashimoto's principle applied systematically: observe the failure, engineer the solution, prevent recurrence.

Garbage collection

Boeckeler describes "agents that run periodically to find inconsistencies in documentation or violations of architectural constraints." For a data access harness, this means:

Detecting metrics whose underlying tables have changed
Flagging metrics that reference columns that no longer exist
Identifying metrics that haven't been queried in 90 days (candidates for deprecation)
Validating that metric SQL still produces results against the current schema

These checks run on a schedule, independent of any agent interaction. They maintain the health of the harness itself.

Context engineering is still necessary

To be clear: harness engineering does not replace context engineering. It contains it.

In a harnessed data access system, context engineering still plays a role:

Metric discovery: The agent needs context to understand which metric matches the user's intent. Semantic search over metric descriptions, canonical questions, and tags provides this context.
Filter application: The agent needs to understand the user's time range, grouping preferences, and filter values. This is context interpretation.
Result reasoning: After receiving deterministic data, the agent uses its full context to interpret the results. "Revenue is $4,298, down 12% from Q3, concentrated in the enterprise tier." This is where AI reasoning shines.

The key distinction is where each discipline operates:

Context engineering operates at the reasoning layer (intent understanding, result interpretation)
Harness engineering operates at the data access layer (constraints, validation, execution)

Problems arise when context engineering is applied to the data access layer, where deterministic constraints would be more appropriate. You would not use prompt engineering to enforce authentication in a web application. You should not use context engineering to enforce data governance in an agent system.

The production gap

There is a well-documented gap between AI demos and AI production systems. Agents that work perfectly in demonstrations fail 20% to 40% of the time in production. This gap is not a context problem. It is a harness problem.

For data access specifically, the production gap manifests as:

Inconsistent results: The same question returns different numbers on different days because the generated SQL varies.
Silent failures: The query runs successfully and returns a number, but the number is wrong because of a subtle JOIN error or missing filter.
Security incidents: The agent accesses data it should not have, because the generated SQL navigated around schema restrictions.
Ungovernable definitions: Three agents define "revenue" three different ways because each generates its own SQL from the same schema.

Every one of these failures can be eliminated by harness engineering. Not reduced. Eliminated.

When agents can only select pre-defined metrics, results are always consistent. When SQL is pre-written and tested, there are no silent failures. When access is constrained to a metric catalog, there are no unauthorized queries. When definitions live in versioned metric files, every agent uses the same definition.

The Hashimoto test

Mitchell Hashimoto's principle provides a useful test for any AI system: when the agent makes a mistake, can you engineer a solution that prevents it from ever happening again?

For context-engineered data access, the answer is usually "sort of." You can add more context, more examples, more constraints to the prompt. But the fix is probabilistic. It reduces the likelihood of the mistake without eliminating it.

For harness-engineered data access, the answer is "yes, completely." The mistake cannot recur because the system does not have the capability to make it. An agent that can only call query_metric("total_revenue") cannot produce an incorrect SQL JOIN. The failure mode does not exist.

This is the difference between reducing risk and removing risk. For production data access, removal is the right standard.

Practical implementation

Building a data access harness does not require building everything from scratch. The core components are:

A metric catalog: YAML or code definitions of every metric the agent can access, with the SQL, tags, filters, and canonical questions for each.
A query execution layer: Connects to your warehouse (Snowflake, PostgreSQL, ClickHouse), executes the pre-defined SQL, applies PII masking, enforces row limits.
An agent interface: Exposes the catalog via MCP (Model Context Protocol) so any compatible agent can discover and query metrics through standard tool calls.
An audit system: Logs every query with attribution, timing, and results for compliance and feedback loops.

# What the agent calls (classification, not generation)
query_metric("total_revenue", filters={"time_start": "2026-01-01"})

# What executes (pre-defined, tested, versioned)
SELECT SUM(total_cents) / 100.0 AS revenue_usd
FROM orders WHERE status = 2 AND created_at >= '2026-01-01'

# What the agent returns (AI reasoning on deterministic data)
"Revenue is $4,298 this quarter, down 12% from Q3."

The AI is in the loop for reasoning. It is out of the loop for data access. Each layer does what it does best.

The bottom line

The context engineering conversation was an important step forward. It moved the industry beyond naive prompt engineering toward thoughtful information design.

The harness engineering conversation is the next step. It moves beyond optimizing model inputs toward designing systems that are reliable regardless of model behavior.

For data access, the harness is not optional. Your agents are querying production databases that power real business decisions. The data they return affects revenue calculations, customer communications, compliance reports, and strategic planning.

Context engineering makes the model smarter. Harness engineering makes the system reliable. For data access, you need both, but if you had to choose, choose the harness.

Don't help AI guess your data. Remove the guess.

OnlyMetrix is a data access harness for AI agents. Pre-defined metrics, deterministic execution, full audit trail. MCP-native, works with Snowflake, PostgreSQL, and ClickHouse. Try the beta.

Ready to try deterministic data access?

Define your first metric. Let agents query safely.

Get Started