evald.ai page 16

METR Blog June 05, 2025 07:00

Recent Frontier Models Are Reward Hacking

In the last few months, we’ve seen increasingly clear examples of reward hacking on our tasks: AI systems try to “cheat” and get impossibly high scores. They do this by exploiting bugs in our scoring code or subverting the task setup, rather than actually...

Google News LLM Evaluation May 31, 2025 07:57

Topic: Artificial intelligence (AI) benchmark and training - Statista

Topic: Artificial intelligence (AI) benchmark and training Statista

Benchmarks

Google News LLM Evaluation May 26, 2025 07:00

Moving LLM evaluation forward: lessons from human judgment research - Frontiers

Moving LLM evaluation forward: lessons from human judgment research Frontiers

LLM Evaluation

Google News LLM Evaluation May 20, 2025 07:00

Benchmarking LLMs: A guide to AI model evaluation - TechTarget

Benchmarking LLMs: A guide to AI model evaluation TechTarget

LLM Evaluation

OpenAI Evaluation Filter May 12, 2025 10:30

Introducing HealthBench

HealthBench is a new evaluation benchmark for AI in healthcare which evaluates models in realistic scenarios. Built with input from 250+ physicians, it aims to provide a shared standard for model performance and safety in health.

Benchmarks

METR Blog April 16, 2025 07:00

Details about METR's preliminary evaluation of OpenAI's o3 and o4-mini

METR conducted a preliminary evaluation of OpenAI's o3 and o4-mini. The two models displayed higher autonomous capabilities than other public models tested, and o3 appears somewhat prone to "reward hacking".

OpenAI Evaluation Filter April 15, 2025 00:00

Our updated Preparedness Framework

Sharing our updated framework for measuring and protecting against severe harm from frontier AI capabilities.

Safety Evals

OpenAI Evaluation Filter April 10, 2025 10:00

BrowseComp: a benchmark for browsing agents

BrowseComp: a benchmark for browsing agents.

Benchmarks

Hugging Face Evaluation Filter April 08, 2025 00:00

Arabic Leaderboards: Introducing Arabic Instruction Following, Updating AraGen, and More

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

METR Blog April 04, 2025 07:00

Details about METR's preliminary evaluation of Claude 3.7

METR conducted a preliminary evaluation of Claude 3.7 Sonnet. While we failed to find significant evidence for a dangerous level of autonomous capabilities, the model displayed impressive AI R&D capabilities on a subset of RE-Bench which provides the...

Google DeepMind Evaluation Filter April 02, 2025 13:31

Taking a responsible path to AGI

We’re exploring the frontiers of AGI, prioritizing technical safety, proactive risk assessment, and collaboration with the AI community.

Safety Evals

OpenAI Evaluation Filter April 02, 2025 10:15

PaperBench: Evaluating AI’s Ability to Replicate AI Research

We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research.

Benchmarks