evald.ai Sources

METR Blog

Task Substitution and Uplift

We distinguish three measures of AI uplift -- on old tasks, on new tasks, and in value -- and show that task substitution can cause these to diverge substantially.

METR Blog

Evidence on AI R&D Progress from NanoGPT

Classifying human and agent contributions to the NanoGPT speedrun, and what publicly tracked challenges can tell us about AI R&D acceleration.

METR Blog

Fine-tuning experiments on CoT controllability

We find that a small amount of fine-tuning on instruction following in the CoT generalizes to meaningful increases in CoT controllability on an out-of-distribution set of tasks. We fine-tune four reasoning models on small datasets of instruction-following...

METR Blog

We spent 2 hours working in the future

Thomas Kwa describes a tabletop exercise where METR researchers simulated having access to ~200-hour time horizon AIs.

METR Blog

Many SWE-bench-Passing PRs Would Not Be Merged into Main

We find that roughly half of test-passing SWE-bench Verified PRs written by recent AI agents would not be merged into main by repo maintainers. A naive interpretation of benchmark scores may lead one to overestimate how useful agents are without more...