METR Blog page 6

METR Blog August 06, 2024 17:00

An update on our general capability evaluations

More tasks, human baselines, and preliminary results for GPT-4 and Claude.

METR Blog June 02, 2024 18:00

ERROR: The request could not be satisfied

Comments on NIST’s draft document “AI Risk Management Framework: Generative AI Profile.”

METR Blog May 16, 2024 07:00

ML Engineers Needed for New AI R&D Evals Project

METR is hiring ML engineers and researchers.

METR Blog April 26, 2024 07:00

Emma Abele is METR’s new Executive Director

Emma moves from President to Executive Director, Beth moves to Head of Research.

METR Blog March 15, 2024 12:00

Autonomy Evaluation Resources

A collection of resources for evaluating potentially dangerous autonomous capabilities of frontier models.

METR Blog March 15, 2024 11:00

Example autonomy evaluation protocol

An example protocol for the whole evaluation process, based on our task suite, elicitation protocol, and scoring methods.

METR Blog March 15, 2024 10:00

GitHub - METR/public-tasks

Contribute to METR/public-tasks development by creating an account on GitHub.

METR Blog March 15, 2024 09:00

Guidelines for capability elicitation

Priorities for approximating the full potential capability of an AI agent, and recommended checks for evaluation validity.

METR Blog March 15, 2024 08:00

Measuring the impact of post-training enhancements

A brief research report outlining quantitative research which could inform the "safety margin" to add to take into account further post-training enhancements to agent capability.

METR Blog February 29, 2024 08:00

Portable Evaluation Tasks via the METR Task Standard

METR has published a standard way to define tasks for evaluating the capabilities of AI agents.

METR Blog February 07, 2024 08:00

2023 Year In Review

A summary of what METR accomplished in 2023 – our first full year of operation.

METR Blog December 16, 2023 20:00

Bounty: Diverse hard tasks for LLM agents

METR (formerly ARC Evals) is looking for (1) ideas, (2) detailed specifications, and (3) well-tested implementations for tasks to measure performance of autonomous LLM agents.