Claude page 2

METR Blog February 17, 2026 08:00

Analyzing coding agent transcripts to upper bound productivity gains from AI agents

Amy Deng investigates whether coding agent transcripts could serve as an alternative for estimating AI productivity uplift, using 5305 Claude Code transcripts from METR technical staff.

Claude Claude Code

METR Blog February 13, 2026 08:00

Measuring Time Horizon using Claude Code and Codex

Nikola Jurkovic describes our measurements of time horizon using Claude Code and Codex scaffolds.

Claude Claude Code

METR Blog August 22, 2025 07:00

Claude, GPT, and Gemini All Struggle to Evade Monitors

Vincent Cheng and Thomas Kwa replicate a Google DeepMind paper on chain-of-thought monitoring, showing evidence that monitoring works on other companies' models.

Claude Gemini Google Google DeepMind

METR Blog April 04, 2025 07:00

Details about METR's preliminary evaluation of Claude 3.7

METR conducted a preliminary evaluation of Claude 3.7 Sonnet. While we failed to find significant evidence for a dangerous level of autonomous capabilities, the model displayed impressive AI R&D capabilities on a subset of RE-Bench which provides the...

Claude

METR Blog March 05, 2025 08:00

Details about METR's preliminary evaluation of DeepSeek-R1

We evaluated DeepSeek-R1 for dangerous autonomous capabilities and found no evidence of dangerous capabilities beyond those of existing models such as Claude 3.5 Sonnet and GPT-4o. Interestingly, we found that it did not do substantially better than...

Claude

METR Blog February 12, 2025 08:00

Details about METR's preliminary evaluation of DeepSeek-V3

We evaluated DeepSeek-V3 for dangerous autonomous capabilities and found no evidence of dangerous capabilities beyond those of existing models such as Claude 3.5 Sonnet and GPT-4o. We also confirmed that its performance on GPQA is not due to training data...

Claude

METR Blog January 31, 2025 08:00

An update on our preliminary evaluations of Claude 3.5 Sonnet and o1

Preliminary evaluations of Claude 3.5 Sonnet (New) and o1, as well as some discussion of challenges in making capability-based safety arguments for AI models.

Claude

METR Blog November 22, 2024 08:00

Evaluating frontier AI R&D capabilities of language model agents against human experts

We’re releasing RE-Bench, a new benchmark for measuring the performance of humans and frontier model agents on ML research engineering tasks. We also share data from 71 human expert attempts and results for Anthropic’s Claude 3.5 Sonnet and OpenAI’s...

Benchmarks

Anthropic Claude Benchmarks OpenAI

METR Blog August 06, 2024 17:00

An update on our general capability evaluations

More tasks, human baselines, and preliminary results for GPT-4 and Claude.

Claude

METR Blog March 17, 2023 15:22

Update on ARC's recent eval efforts

More information about ARC's evaluations of GPT-4 and Claude

LLM Evaluation

Claude LLM Evaluation