We spent 2 hours working in the future
Thomas Kwa describes a tabletop exercise where METR researchers simulated having access to ~200-hour time horizon AIs.
Community feed
A focused stream of recent stories from the sources curated for this community. Latest: We spent 2 hours working in the future, Arena Leaderboard: The Unbreakable Ranking System That’s Revolutionizing AI Model Evaluation - CryptoRank, and Measuring progress toward AGI: A cognitive framework. Page 5.
Thomas Kwa describes a tabletop exercise where METR researchers simulated having access to ~200-hour time horizon AIs.
Arena Leaderboard: The Unbreakable Ranking System That’s Revolutionizing AI Model Evaluation CryptoRank
Google DeepMind proposes a cognitive framework to evaluate AGI and launches a Kaggle hackathon to build capability benchmarks
AI benchmark numbers are meaningless — here’s what to look for instead MakeUseOf
MLCommons is developing the AILuminate Culturally-Specific Multimodal Benchmark to close the AI performance and representation gap across APAC cultures, languages, and real-world use cases.
MLPerf Inference v6.0 upgrades its edge object detection benchmark from RetinaNet to YOLOv11, bringing modern real-time detection to standardized AI hardware evaluation
Qwen3.5-9B tops every AI benchmark right now, but that's not how you should pick a model XDA
External review from METR of Anthropic's Sabotage Risk Report for Claude Opus 4.6
MiniMax M2.5 Sparks AI Benchmark Fraud Debate AI CERTs
MLCommons introduces the new Text-to-Video benchmark in MLPerf Inference v6.0, based on the Wan2.2-T2V-A14B-Diffusers model and validated using the VBench framework. Learn about the key architectural decisions, including the adoption of the SingleStream...
How Anthropic’s Claude Opus 4.6 Broke Its Own AI Benchmark WinBuzzer
What is Model Evaluation? IBM
More stories load automatically as you scroll.