evald.ai Sources

METR Blog

Early Results on Monitorability in QA Settings

Vincent Cheng, Thomas Kwa, and Neev Parikh share research on how AI agents can hide secondary task-solving from monitors, finding that harder tasks are more detectable and small models can learn to evade larger monitors.

METR Blog

Forecasting the Impacts of AI R&D Acceleration: Results of a Pilot Study

AI agents are improving rapidly at autonomous software development and machine learning tasks, and, if recent trends hold, may match human researchers at challenging months-long research projects in under a decade. Some economic models predict that...

METR Blog

CoT May Be Highly Informative Despite “Unfaithfulness”

Recent work from Anthropic and others claims that LLMs' chains of thoughts can be “unfaithful”. These papers make an important point: you can't take everything in the CoT at face value. As a result, people often use these results to conclude the CoT is...

Anthropic

METR Blog

How Does Time Horizon Vary Across Domains?

We build on our time-horizon work and analyze 9 benchmarks for scientific reasoning, math, robotics, computer use, and self-driving in terms of time-horizon trends; we observe generally similar rates of improvement to the 7-month doubling time in our...

Benchmarks

Benchmarks