An update on our general capability evaluations
More tasks, human baselines, and preliminary results for GPT-4 and Claude.
More tasks, human baselines, and preliminary results for GPT-4 and Claude.
Comments on NIST’s draft document “AI Risk Management Framework: Generative AI Profile.”
METR is hiring ML engineers and researchers.
Emma moves from President to Executive Director, Beth moves to Head of Research.
A collection of resources for evaluating potentially dangerous autonomous capabilities of frontier models.
An example protocol for the whole evaluation process, based on our task suite, elicitation protocol, and scoring methods.
Contribute to METR/public-tasks development by creating an account on GitHub.
Priorities for approximating the full potential capability of an AI agent, and recommended checks for evaluation validity.
A brief research report outlining quantitative research which could inform the "safety margin" to add to take into account further post-training enhancements to agent capability.
METR has published a standard way to define tasks for evaluating the capabilities of AI agents.
A summary of what METR accomplished in 2023 – our first full year of operation.
METR (formerly ARC Evals) is looking for (1) ideas, (2) detailed specifications, and (3) well-tested implementations for tasks to measure performance of autonomous LLM agents.