evald.ai Sources

METR Blog

Autonomy Evaluation Resources

A collection of resources for evaluating potentially dangerous autonomous capabilities of frontier models.

METR Blog

Example autonomy evaluation protocol

An example protocol for the whole evaluation process, based on our task suite, elicitation protocol, and scoring methods.

METR Blog

GitHub - METR/public-tasks

Contribute to METR/public-tasks development by creating an account on GitHub.

METR Blog

Guidelines for capability elicitation

Priorities for approximating the full potential capability of an AI agent, and recommended checks for evaluation validity.

METR Blog

Measuring the impact of post-training enhancements

A brief research report outlining quantitative research which could inform the "safety margin" to add to take into account further post-training enhancements to agent capability.

METR Blog

2023 Year In Review

A summary of what METR accomplished in 2023 – our first full year of operation.

METR Blog

Bounty: Diverse hard tasks for LLM agents

METR (formerly ARC Evals) is looking for (1) ideas, (2) detailed specifications, and (3) well-tested implementations for tasks to measure performance of autonomous LLM agents.