Benchmarks page 8

METR Blog August 13, 2025 07:00

Research Update: Algorithmic vs. Holistic Evaluation

Many AI benchmarks use algorithmic scoring to evaluate how well AI systems perform on some set of tasks. However, AI systems often produce code that scores well but isn't production-ready due to issues with test coverage, formatting, and code quality. This...

Benchmarks LLM Evaluation

Google News LLM Evaluation August 06, 2025 07:00

Is your AI benchmark lying to you? - Nature

Is your AI benchmark lying to you? Nature

Benchmarks

Hugging Face Evaluation Filter August 01, 2025 14:25

📚 3LM: A Benchmark for Arabic LLMs in STEM and Code

A Blog post by Technology Innovation Institute on Hugging Face

Benchmarks

Google News LLM Evaluation August 01, 2025 07:00

MLPerf Client 1.0 AI benchmark released — new testing toolkit sports a GUI, covers more models and tasks, and supports more hardware acceleration paths - Tom's Hardware

MLPerf Client 1.0 AI benchmark released — new testing toolkit sports a GUI, covers more models and tasks, and supports more hardware acceleration paths Tom's Hardware

Benchmarks Testing Tools

Google News LLM Evaluation July 24, 2025 07:00

This new AI benchmark tests how much AI sucks up to you - fanaticalfuturist.com

This new AI benchmark tests how much AI sucks up to you fanaticalfuturist.com

Benchmarks

Google News LLM Evaluation July 23, 2025 07:00

German Gov-Backed AI Benchmark Tracks Large Language Models in 200 Languages - Slator

German Gov-Backed AI Benchmark Tracks Large Language Models in 200 Languages Slator

Benchmarks

METR Blog July 14, 2025 07:00

How Does Time Horizon Vary Across Domains?

We build on our time-horizon work and analyze 9 benchmarks for scientific reasoning, math, robotics, computer use, and self-driving in terms of time-horizon trends; we observe generally similar rates of improvement to the 7-month doubling time in our...

Benchmarks

Google News LLM Evaluation July 10, 2025 07:00

Elon Musk’s xAI sets AI benchmark records with new reasoning-optimized Grok 4 model - SiliconANGLE

Elon Musk’s xAI sets AI benchmark records with new reasoning-optimized Grok 4 model SiliconANGLE

Benchmarks

Benchmarks xAI

Google News LLM Evaluation July 10, 2025 07:00

Former Intel CEO’s New AI Benchmark Focuses on Human Flourishing - The New Stack

Former Intel CEO’s New AI Benchmark Focuses on Human Flourishing The New Stack

Benchmarks

Google News LLM Evaluation May 31, 2025 07:57

Topic: Artificial intelligence (AI) benchmark and training - Statista

Topic: Artificial intelligence (AI) benchmark and training Statista

Benchmarks

OpenAI Evaluation Filter May 12, 2025 10:30

Introducing HealthBench

HealthBench is a new evaluation benchmark for AI in healthcare which evaluates models in realistic scenarios. Built with input from 250+ physicians, it aims to provide a shared standard for model performance and safety in health.

Benchmarks

OpenAI Evaluation Filter April 10, 2025 10:00

BrowseComp: a benchmark for browsing agents

BrowseComp: a benchmark for browsing agents.

Benchmarks