evald.ai page 5

METR Blog March 19, 2026 07:00

We spent 2 hours working in the future

Thomas Kwa describes a tabletop exercise where METR researchers simulated having access to ~200-hour time horizon AIs.

Google News LLM Evaluation March 18, 2026 07:00

Arena Leaderboard: The Unbreakable Ranking System That’s Revolutionizing AI Model Evaluation - CryptoRank

Arena Leaderboard: The Unbreakable Ranking System That’s Revolutionizing AI Model Evaluation  CryptoRank

Google DeepMind Evaluation Filter March 17, 2026 16:03

Measuring progress toward AGI: A cognitive framework

Google DeepMind proposes a cognitive framework to evaluate AGI and launches a Kaggle hackathon to build capability benchmarks

Google News LLM Evaluation March 15, 2026 07:00

AI benchmark numbers are meaningless — here’s what to look for instead - MakeUseOf

AI benchmark numbers are meaningless — here’s what to look for instead  MakeUseOf

MLCommons Evaluation Filter March 13, 2026 16:57

Global Standards, Local Ground Truths: Piloting Multilingual, Multimodal AI Safety Understanding in APAC - MLCommons

MLCommons is developing the AILuminate Culturally-Specific Multimodal Benchmark to close the AI performance and representation gap across APAC cultures, languages, and real-world use cases.

MLCommons Evaluation Filter March 12, 2026 15:21

YOLO for the MLPerf Inference v6.0 Edge Suite - MLCommons

MLPerf Inference v6.0 upgrades its edge object detection benchmark from RetinaNet to YOLOv11, bringing modern real-time detection to standardized AI hardware evaluation

Google News LLM Evaluation March 12, 2026 07:00

Qwen3.5-9B tops every AI benchmark right now, but that's not how you should pick a model - XDA

Qwen3.5-9B tops every AI benchmark right now, but that's not how you should pick a model  XDA

METR Blog March 12, 2026 07:00

Review of the Anthropic Sabotage Risk Report: Claude Opus 4.6

External review from METR of Anthropic's Sabotage Risk Report for Claude Opus 4.6

Google News LLM Evaluation March 10, 2026 17:22

MiniMax M2.5 Sparks AI Benchmark Fraud Debate - AI CERTs

MiniMax M2.5 Sparks AI Benchmark Fraud Debate  AI CERTs

MLCommons Evaluation Filter March 10, 2026 14:17

Bringing Text-to-Video to MLPerf Inference v6.0 - MLCommons

MLCommons introduces the new Text-to-Video benchmark in MLPerf Inference v6.0, based on the Wan2.2-T2V-A14B-Diffusers model and validated using the VBench framework. Learn about the key architectural decisions, including the adoption of the SingleStream...

Google News LLM Evaluation March 10, 2026 07:00

How Anthropic’s Claude Opus 4.6 Broke Its Own AI Benchmark - WinBuzzer

How Anthropic’s Claude Opus 4.6 Broke Its Own AI Benchmark  WinBuzzer

Google News LLM Evaluation March 10, 2026 07:00

What is Model Evaluation? - IBM

What is Model Evaluation?  IBM