METR Blog

Details about METR's preliminary evaluation of Claude 3.7

METR conducted a preliminary evaluation of Claude 3.7 Sonnet. While we failed to find significant evidence for a dangerous level of autonomous capabilities, the model displayed impressive AI R&D capabilities on a subset of RE-Bench which provides the...

Google DeepMind Evaluation Filter

Taking a responsible path to AGI

We’re exploring the frontiers of AGI, prioritizing technical safety, proactive risk assessment, and collaboration with the AI community.

Safety Evals

OpenAI Evaluation Filter

New funding to build towards AGI

Today we’re announcing new funding—$40B at a $300B post-money valuation, which enables us to push the frontiers of AI research even further, scale our compute infrastructure, and deliver increasingly powerful tools for the 500 million people who use ChatGPT...

METR Blog

Measuring AI Ability to Complete Long Tasks

We propose measuring AI performance in terms of the *length* of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend...

METR Blog

Response to OSTP on AI Action Plan

Suggested priorities for the Office of Science and Technology Policy as it develops an AI Action Plan.

More stories

More stories load automatically as you scroll.