Idler builds the tasks that train and measure frontier models, drawn from real work in coding, finance, science, and defense. Fair, long-horizon, and graded against ground truth.
Every task is a real problem with a checkable outcome. A model works it step by step, graded against ground truth, so the same task gives the same number for every lab.
# the model attempts a real task; every step is graded step 01 read_ledger(q3) pass step 02 match_invoices() pass step 03 flag_discrepancy(#4471) fail // off by 0.02 step 04 reconcile(adjustment) pass reward = 0.74 graded against the closing balance
Drawn from real engineering and operations, not invented. The skill transfers.
Every task tested, every failure mode expert-verified, every accepted answer truly correct.
Stratified by type and graded step by step, so the signal stays dense and informative.
Real bugs, features, and refactors in live repos.
Reconciliation, modeling, and long-horizon analysis.
Bio, pharma, and research workflows.
High-stakes capability and stress-testing.