Benchmarks page 2

Google News AI Safety Evaluation + 1 source May 09, 2026 08:38

Claude Mythos Shows 50% Time Horizon Of 16+ Hours On METR Benchmark - OfficeChai

Claude Mythos Shows 50% Time Horizon Of 16+ Hours On METR Benchmark OfficeChai

Benchmarks

Claude Benchmarks Mythos

Google News SWE Bench May 07, 2026 16:02

Scale Labs debuts new Refactoring Leaderboard for AI - TestingCatalog AI News

Scale Labs debuts new Refactoring Leaderboard for AI TestingCatalog AI News

Benchmarks

Google News MLPerf May 07, 2026 15:00

Cisco and AMD Benchmark Scale-out AI Fabric Performance - Let's Data Science

Cisco and AMD Benchmark Scale-out AI Fabric Performance Let's Data Science

Benchmarks

MLCommons Evaluation Filter May 07, 2026 13:23

GPT-OSS 20B: A Sparse MoE Pretraining Benchmark for MLPerf Training v6.0 - MLCommons

MLPerf Training v6.0 introduces GPT-OSS 20B, a new sparse Mixture-of-Experts (MoE) pretraining benchmark designed for accessibility on single 8-GPU nodes.

Benchmarks

Google News AI Benchmarks + 1 source May 07, 2026 12:27

DeepSeek V4 analysis: What's the point of topping the AI leaderboard if nobody can afford you? - news.cgtn.com

DeepSeek V4 analysis: What's the point of topping the AI leaderboard if nobody can afford you? news.cgtn.com

Benchmarks

Google News Eval Frameworks May 06, 2026 09:19

Claude Opus 4.7, Gemini 3.1 Pro, and Others Score 0% on New SWE Benchmark - Analytics India Magazine

Claude Opus 4.7, Gemini 3.1 Pro, and Others Score 0% on New SWE Benchmark Analytics India Magazine

Benchmarks

Claude Claude Opus Benchmarks Gemini

Hugging Face Evaluation Filter May 06, 2026 00:00

Adding Benchmaxxer Repellant to the Open ASR Leaderboard

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

MLCommons Evaluation Filter May 05, 2026 13:37

DeepSeek-V3: A Large-Scale MoE Pretraining Benchmark for MLPerf Training v6.0 - MLCommons

MLPerf Training v6.0 introduces a large-scale pretraining benchmark built on DeepSeek-V3, bringing Mixture-of-Experts (MoE) evaluation to the suite.

Benchmarks

Google News LLM Evaluation April 30, 2026 09:55

Alphabet just became the Magnificent 7's new AI benchmark - Opening Bell Daily

Alphabet just became the Magnificent 7's new AI benchmark Opening Bell Daily

Benchmarks

Hugging Face Evaluation Filter April 21, 2026 10:09

QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard

A Blog post by Technology Innovation Institute on Hugging Face

Benchmarks LLM Evaluation

MLCommons Evaluation Filter April 20, 2026 22:10

Fresh Benchmarks, Reliable Scores: Introducing Continuous Prompt Stewardship for AI Risk Evaluation - MLCommons

MLCommons introduces Continuous Prompt Stewardship to keep the AILuminate AI safety benchmark fresh and reliable as frontier models evolve.

Benchmarks Safety Evals

Safety Evals Benchmarks

Google News LLM Evaluation April 16, 2026 15:21

EQS AI Benchmark Report Vol. 2 - EQS Group

EQS AI Benchmark Report Vol. 2 EQS Group

Benchmarks