Metrics That Improve Overnight

+7.4% F1 improvement. 29 variations tested. Under a second. Your metrics get better while you sleep.

The autoresearch loop

Autoresearch treats metric definitions as hypotheses. It generates variations, scores them against ground truth, and suggests the best one for human approval.

1

Define with ground truth

Start with a metric definition and a labeled dataset. "These 340 customers churned. This is the ground truth."

2

Generate filter variations

Autoresearch generates dozens of filter variations from the IR. Different thresholds, different date ranges, different aggregation windows, all systematically explored.

3

Score against ground truth

Each variation is scored on precision, recall, and F1 against the labeled data. No opinions, just measurements.

4

Pareto frontier

The results map to a Pareto frontier. the optimal tradeoffs between precision and recall. You see the landscape of possibilities, not just one answer.

5

Suggest for approval

The best variation is suggested for human review. You approve it, reject it, or refine the ground truth. The metric improves. The loop continues.

29 variations. Under a second.

Real numbers from a retail transaction dataset. Not synthetic benchmarks.

0.660
Baseline F1
2,696
of 5,942 customers flagged
0.701
Precision
0.625
Recall

The Karpathy Loop

Metrics are neural nets. They take inputs (your data), apply transformations (filters, aggregations), and produce outputs (numbers your business acts on). The question is: are they trained?

Autoresearch is the training loop. Define the loss function (ground truth). Generate variations (hyperparameter search). Score them (evaluation). Pick the best one (model selection). Your metrics get better the same way models get better: through systematic iteration, not guesswork.

Let your metrics improve themselves

Define ground truth. Autoresearch does the rest.