evald.ai page 20

METR Blog April 16, 2025 07:00

Details about METR's preliminary evaluation of OpenAI's o3 and o4-mini

METR conducted a preliminary evaluation of OpenAI's o3 and o4-mini. The two models displayed higher autonomous capabilities than other public models tested, and o3 appears somewhat prone to "reward hacking".

OpenAI Evaluation Filter April 15, 2025 00:00

Our updated Preparedness Framework

Sharing our updated framework for measuring and protecting against severe harm from frontier AI capabilities.

Safety Evals

OpenAI Evaluation Filter April 10, 2025 10:00

BrowseComp: a benchmark for browsing agents

BrowseComp: a benchmark for browsing agents.

Benchmarks

Hugging Face Evaluation Filter April 08, 2025 00:00

Arabic Leaderboards: Introducing Arabic Instruction Following, Updating AraGen, and More

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

METR Blog April 04, 2025 07:00

Details about METR's preliminary evaluation of Claude 3.7

METR conducted a preliminary evaluation of Claude 3.7 Sonnet. While we failed to find significant evidence for a dangerous level of autonomous capabilities, the model displayed impressive AI R&D capabilities on a subset of RE-Bench which provides the...

Google DeepMind Evaluation Filter April 02, 2025 13:31

Taking a responsible path to AGI

We’re exploring the frontiers of AGI, prioritizing technical safety, proactive risk assessment, and collaboration with the AI community.

Safety Evals

OpenAI Evaluation Filter April 02, 2025 10:15

PaperBench: Evaluating AI’s Ability to Replicate AI Research

We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research.

Benchmarks

OpenAI Evaluation Filter March 31, 2025 15:00

New funding to build towards AGI

Today we’re announcing new funding—$40B at a $300B post-money valuation, which enables us to push the frontiers of AI research even further, scale our compute infrastructure, and deliver increasingly powerful tools for the 500 million people who use ChatGPT...

METR Blog March 19, 2025 07:00

Measuring AI Ability to Complete Long Tasks

We propose measuring AI performance in terms of the *length* of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend...

METR Blog March 17, 2025 09:00

evald.ai