OpenAI Evaluation Filter

Introducing SimpleQA

A factuality benchmark called SimpleQA that measures the ability for language models to answer short, fact-seeking questions.

Benchmarks

METR Blog

ERROR: The request could not be satisfied

Red-teaming and security suggestions regarding proposed rule by the Bureau of Industry and Security, “Establishment of Reporting Requirements for the Development of Advanced Artificial Intelligence Models and Computing Clusters.”

METR Blog

Details about METR's preliminary evaluation of OpenAI o1-preview

We measured the performance of OpenAI's o1-mini and o1-preview models on our autonomy and AI R&D task suites, and found they did not exceed the capabilities of the best existing public model we've evaluated, though we could not confidently upper-bound...

METR Blog

ERROR: The request could not be satisfied

Suggestions for expanded guidance on capability elicitation and robust model safeguards in the U.S. AI Safety Institute’s draft document “Managing Misuse Risk for Dual-Use Foundation Models” (NIST AI 800-1).

Safety Evals

More stories

More stories load automatically as you scroll.