Benchmarks page 3

MLCommons Evaluation Filter March 19, 2026 18:59

Standardizing Generative AI Service Evaluation: An API-Centric Benchmarking Approach - MLCommons

MLPerf® Endpoints brings API-native benchmarking, Pareto curve visualizations, and rolling submissions to generative AI infrastructure evaluation.

Benchmarks

Google News LLM Evaluation March 18, 2026 07:00

Arena Leaderboard: The Unbreakable Ranking System That’s Revolutionizing AI Model Evaluation - CryptoRank

Arena Leaderboard: The Unbreakable Ranking System That’s Revolutionizing AI Model Evaluation CryptoRank

Benchmarks LLM Evaluation

Google DeepMind Evaluation Filter March 17, 2026 16:03

Measuring progress toward AGI: A cognitive framework

Google DeepMind proposes a cognitive framework to evaluate AGI and launches a Kaggle hackathon to build capability benchmarks

Benchmarks

Google News LLM Evaluation March 15, 2026 07:00

AI benchmark numbers are meaningless — here’s what to look for instead - MakeUseOf

AI benchmark numbers are meaningless — here’s what to look for instead MakeUseOf

Benchmarks

MLCommons Evaluation Filter March 13, 2026 16:57

Global Standards, Local Ground Truths: Piloting Multilingual, Multimodal AI Safety Understanding in APAC - MLCommons

MLCommons is developing the AILuminate Culturally-Specific Multimodal Benchmark to close the AI performance and representation gap across APAC cultures, languages, and real-world use cases.

Benchmarks Safety Evals

MLCommons Evaluation Filter March 12, 2026 15:21

YOLO for the MLPerf Inference v6.0 Edge Suite - MLCommons

MLPerf Inference v6.0 upgrades its edge object detection benchmark from RetinaNet to YOLOv11, bringing modern real-time detection to standardized AI hardware evaluation

Benchmarks

Google News LLM Evaluation March 12, 2026 07:00

Qwen3.5-9B tops every AI benchmark right now, but that's not how you should pick a model - XDA

Qwen3.5-9B tops every AI benchmark right now, but that's not how you should pick a model XDA

Benchmarks

Google News LLM Evaluation March 10, 2026 17:22

MiniMax M2.5 Sparks AI Benchmark Fraud Debate - AI CERTs

MiniMax M2.5 Sparks AI Benchmark Fraud Debate AI CERTs

Benchmarks

MLCommons Evaluation Filter March 10, 2026 14:17

Bringing Text-to-Video to MLPerf Inference v6.0 - MLCommons

MLCommons introduces the new Text-to-Video benchmark in MLPerf Inference v6.0, based on the Wan2.2-T2V-A14B-Diffusers model and validated using the VBench framework. Learn about the key architectural decisions, including the adoption of the SingleStream...

Benchmarks

Google News LLM Evaluation March 10, 2026 07:00

How Anthropic’s Claude Opus 4.6 Broke Its Own AI Benchmark - WinBuzzer

How Anthropic’s Claude Opus 4.6 Broke Its Own AI Benchmark WinBuzzer

Benchmarks

METR Blog March 10, 2026 07:00

Many SWE-bench-Passing PRs Would Not Be Merged into Main

We find that roughly half of test-passing SWE-bench Verified PRs written by recent AI agents would not be merged into main by repo maintainers. A naive interpretation of benchmark scores may lead one to overestimate how useful agents are without more...

Benchmarks

Google News LLM Evaluation March 09, 2026 07:00

Researchers build Humanity’s Last Exam AI benchmark | ETIH EdTech News - EdTech Innovation Hub

Researchers build Humanity’s Last Exam AI benchmark | ETIH EdTech News EdTech Innovation Hub

Benchmarks