Benchmarks page 9

OpenAI Evaluation Filter April 10, 2025 10:00

BrowseComp: a benchmark for browsing agents

BrowseComp: a benchmark for browsing agents.

Benchmarks

OpenAI Evaluation Filter April 02, 2025 10:15

PaperBench: Evaluating AI’s Ability to Replicate AI Research

We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research.

Benchmarks

OpenAI Evaluation Filter February 18, 2025 10:00

Introducing the SWE-Lancer benchmark

Can frontier LLMs earn $1 million from real-world freelance software engineering?

Benchmarks

Hugging Face Evaluation Filter February 14, 2025 00:00

Fixing Open LLM Leaderboard with Math-Verify

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

Hugging Face Evaluation Filter February 10, 2025 00:00

The Open Arabic LLM Leaderboard 2

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

Hugging Face Evaluation Filter February 04, 2025 00:00

DABStep: Data Agent Benchmark for Multi-step Reasoning

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

Hugging Face Evaluation Filter January 09, 2025 00:00

CO₂ Emissions and Models Performance: Insights from the Open LLM Leaderboard

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

Google News LLM Evaluation December 11, 2024 08:00

What Makes a Good AI Benchmark? - Stanford HAI

What Makes a Good AI Benchmark? Stanford HAI

Benchmarks

Hugging Face Evaluation Filter December 04, 2024 00:00

Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks LLM Evaluation

METR Blog November 22, 2024 08:00

Evaluating frontier AI R&D capabilities of language model agents against human experts

We’re releasing RE-Bench, a new benchmark for measuring the performance of humans and frontier model agents on ML research engineering tasks. We also share data from 71 human expert attempts and results for Anthropic’s Claude 3.5 Sonnet and OpenAI’s...

Benchmarks

Anthropic Claude Benchmarks OpenAI

Hugging Face Evaluation Filter November 20, 2024 00:00

Introducing the Open Leaderboard for Japanese LLMs!

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

OpenAI Evaluation Filter October 30, 2024 10:00

Introducing SimpleQA

A factuality benchmark called SimpleQA that measures the ability for language models to answer short, fact-seeking questions.

Benchmarks