evald.ai Sources

METR Blog

Details about METR's preliminary evaluation of DeepSeek-V3

We evaluated DeepSeek-V3 for dangerous autonomous capabilities and found no evidence of dangerous capabilities beyond those of existing models such as Claude 3.5 Sonnet and GPT-4o. We also confirmed that its performance on GPQA is not due to training data...

METR Blog

ERROR: The request could not be satisfied

Red-teaming and security suggestions regarding proposed rule by the Bureau of Industry and Security, “Establishment of Reporting Requirements for the Development of Advanced Artificial Intelligence Models and Computing Clusters.”

METR Blog

Details about METR's preliminary evaluation of OpenAI o1-preview

We measured the performance of OpenAI's o1-mini and o1-preview models on our autonomy and AI R&D task suites, and found they did not exceed the capabilities of the best existing public model we've evaluated, though we could not confidently upper-bound...

METR Blog

ERROR: The request could not be satisfied

Suggestions for expanded guidance on capability elicitation and robust model safeguards in the U.S. AI Safety Institute’s draft document “Managing Misuse Risk for Dual-Use Foundation Models” (NIST AI 800-1).

METR Blog

Vivaria

Vivaria is METR's tool for running evaluations and conducting agent elicitation research. Vivaria is a web application with which users can interact using a web UI and a command-line interface.