Benchmarks page 5

Google News LLM Evaluation March 10, 2026 17:22

MiniMax M2.5 Sparks AI Benchmark Fraud Debate - AI CERTs

MiniMax M2.5 Sparks AI Benchmark Fraud Debate AI CERTs

MLCommons Evaluation Filter March 10, 2026 14:17

Bringing Text-to-Video to MLPerf Inference v6.0 - MLCommons

MLCommons introduces the new Text-to-Video benchmark in MLPerf Inference v6.0, based on the Wan2.2-T2V-A14B-Diffusers model and validated using the VBench framework. Learn about the key architectural decisions, including the adoption of the SingleStream...

Benchmarks

Google News LLM Evaluation March 10, 2026 07:00

How Anthropic’s Claude Opus 4.6 Broke Its Own AI Benchmark - WinBuzzer

How Anthropic’s Claude Opus 4.6 Broke Its Own AI Benchmark WinBuzzer

Benchmarks

Anthropic Claude Claude Opus Benchmarks

METR Blog March 10, 2026 07:00

Many SWE-bench-Passing PRs Would Not Be Merged into Main

We find that roughly half of test-passing SWE-bench Verified PRs written by recent AI agents would not be merged into main by repo maintainers. A naive interpretation of benchmark scores may lead one to overestimate how useful agents are without more...

Benchmarks

Google News LLM Evaluation March 09, 2026 07:00

Researchers build Humanity’s Last Exam AI benchmark | ETIH EdTech News - EdTech Innovation Hub

Researchers build Humanity’s Last Exam AI benchmark | ETIH EdTech News EdTech Innovation Hub

Benchmarks

Google News LLM Evaluation March 05, 2026 08:00

NVIDIA Blackwell Smashes Finance AI Benchmark With 3.2x Speed Gains - MEXC

NVIDIA Blackwell Smashes Finance AI Benchmark With 3.2x Speed Gains MEXC

Benchmarks

Google News LLM Evaluation February 28, 2026 08:00

The Bullshit Index: Why the AI Benchmark You've Never Heard Of is the One That Actually Matters - CXOToday.com

The Bullshit Index: Why the AI Benchmark You've Never Heard Of is the One That Actually Matters CXOToday.com

Benchmarks

OpenAI Evaluation Filter February 26, 2026 10:00

Pacific Northwest National Laboratory and OpenAI partner to accelerate federal permitting

OpenAI and Pacific Northwest National Laboratory introduce DraftNEPABench, a new benchmark evaluating how AI coding agents can accelerate federal permitting—showing potential to reduce NEPA drafting time by up to 15% and modernize infrastructure reviews.

Benchmarks

Benchmarks OpenAI

Google News LLM Evaluation February 25, 2026 08:00

NIST Publishes New Guidance to Strengthen AI Benchmark Evaluations - ExecutiveGov

NIST Publishes New Guidance to Strengthen AI Benchmark Evaluations ExecutiveGov

Benchmarks

Google News LLM Evaluation February 18, 2026 19:42

OpenAI Unveils AI Benchmark Tool to Enhance Blockchain Security - thedefiant.io

OpenAI Unveils AI Benchmark Tool to Enhance Blockchain Security thedefiant.io

Benchmarks

Benchmarks OpenAI

OpenAI Evaluation Filter February 18, 2026 00:00

Introducing EVMbench

OpenAI and Paradigm introduce EVMbench, a benchmark evaluating AI agents’ ability to detect, patch, and exploit high-severity smart contract vulnerabilities.

Benchmarks

Benchmarks OpenAI

Google News LLM Evaluation February 16, 2026 08:00

Mathematicians contribute to AI benchmark - The University of Manchester

Mathematicians contribute to AI benchmark The University of Manchester

Benchmarks