Example autonomy evaluation protocol
An example protocol for the whole evaluation process, based on our task suite, elicitation protocol, and scoring methods.
Community feed
A focused stream of recent stories from the sources curated for this community. Latest: Example autonomy evaluation protocol, GitHub - METR/public-tasks, and Guidelines for capability elicitation. Page 28.
An example protocol for the whole evaluation process, based on our task suite, elicitation protocol, and scoring methods.
Contribute to METR/public-tasks development by creating an account on GitHub.
Priorities for approximating the full potential capability of an AI agent, and recommended checks for evaluation validity.
A brief research report outlining quantitative research which could inform the "safety margin" to add to take into account further post-training enhancements to agent capability.
METR has published a standard way to define tasks for evaluating the capabilities of AI agents.
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
A summary of what METR accomplished in 2023 – our first full year of operation.
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
We’re developing a blueprint for evaluating the risk that a large language model (LLM) could aid someone in creating a biological threat. In an evaluation involving both biology experts and students, we found that GPT-4 provides at most a mild uplift in...
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
More stories load automatically as you scroll.