Portable Evaluation Tasks via the METR Task Standard
METR has published a standard way to define tasks for evaluating the capabilities of AI agents.
METR has published a standard way to define tasks for evaluating the capabilities of AI agents.