METR Blog page 4

METR Blog June 27, 2025 07:00

Details about METR's preliminary evaluation of DeepSeek and Qwen models

METR conducted preliminary evaluations of a set of DeepSeek and Qwen models. We found that the level of autonomous capabilities of mid-2025 DeepSeek models is similar to the level of capabilities of frontier models from late 2024.

METR Blog June 27, 2025 07:00

What should companies share about risks from frontier AI models?

Current views on information relevant for visibility into frontier AI risk.

METR Blog June 05, 2025 07:00

Recent Frontier Models Are Reward Hacking

In the last few months, we’ve seen increasingly clear examples of reward hacking on our tasks: AI systems try to “cheat” and get impossibly high scores. They do this by exploiting bugs in our scoring code or subverting the task setup, rather than actually...

METR Blog April 16, 2025 07:00

Details about METR's preliminary evaluation of OpenAI's o3 and o4-mini

METR conducted a preliminary evaluation of OpenAI's o3 and o4-mini. The two models displayed higher autonomous capabilities than other public models tested, and o3 appears somewhat prone to "reward hacking".

METR Blog April 04, 2025 07:00

Details about METR's preliminary evaluation of Claude 3.7

METR conducted a preliminary evaluation of Claude 3.7 Sonnet. While we failed to find significant evidence for a dangerous level of autonomous capabilities, the model displayed impressive AI R&D capabilities on a subset of RE-Bench which provides the...

METR Blog March 19, 2025 07:00

Measuring AI Ability to Complete Long Tasks

We propose measuring AI performance in terms of the *length* of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend...

METR Blog March 17, 2025 09:00

HCAST: Human-Calibrated Autonomy Software Tasks

Abstract page for arXiv paper 2503.17354: HCAST: Human-Calibrated Autonomy Software Tasks

METR Blog March 15, 2025 09:00

Response to OSTP on AI Action Plan

Suggested priorities for the Office of Science and Technology Policy as it develops an AI Action Plan.

METR Blog March 11, 2025 07:00

Why it’s good for AI reasoning to be legible and faithful

Why legible and faithful reasoning is valuable for safely developing powerful AI

METR Blog March 05, 2025 08:00

Details about METR's preliminary evaluation of DeepSeek-R1

We evaluated DeepSeek-R1 for dangerous autonomous capabilities and found no evidence of dangerous capabilities beyond those of existing models such as Claude 3.5 Sonnet and GPT-4o. Interestingly, we found that it did not do substantially better than...

METR Blog February 27, 2025 08:00

METR’s GPT-4.5 pre-deployment evaluations

Additional details about our evaluations of GPT-4.5, and some discussion about the limitations of pre-deployment evaluations and current evaluation methodologies.

METR Blog February 14, 2025 08:00

Measuring Automated Kernel Engineering

We measured the performance of frontier models at writing GPU kernels. With a small amount of scaffolding, we found that the best model can provide an average speedup on KernelBench of 1.8x.