Frontier long-horizon evals

Evals for frontier models.

Idler builds the tasks that train and measure frontier models, drawn from real work in coding, finance, science, and defense. Fair, long-horizon, and graded against ground truth.

The record

A neutral record of what models can do.

Every task is a real problem with a checkable outcome. A model works it step by step, graded against ground truth, so the same task gives the same number for every lab.

rollout · finance/reconciliation-117
# the model attempts a real task; every step is graded
step 01  read_ledger(q3)            pass
step 02  match_invoices()           pass
step 03  flag_discrepancy(#4471)    fail  // off by 0.02
step 04  reconcile(adjustment)      pass
reward = 0.74   graded against the closing balance
Why Idler

Tasks worth training on.

Sourced from the real world

Drawn from real engineering and operations, not invented. The skill transfers.

Exhaustive quality control

Every task tested, every failure mode expert-verified, every accepted answer truly correct.

Calibrated to your difficulty

Stratified by type and graded step by step, so the signal stays dense and informative.

Method

From a skill to a graded task.

01PerceivePick a capability and find where frontier models break on real work.
02RepresentTurn it into a task with a checkable result and a verifiable rubric.
03BuildStand up the environment, the tools, and the grader, clean of contamination.
04ScaleMass-produce variants across the distribution. Early tasks become training data.
05MeasureScore where models fail, then aim the next task at exactly that.
Domains

Where the tasks come from.

Coding

Real bugs, features, and refactors in live repos.

Finance

Reconciliation, modeling, and long-horizon analysis.

Science

Bio, pharma, and research workflows.

Defense

High-stakes capability and stress-testing.

Tell us where your models fail. We build the world that trains them.