Community Evals: Because we're done trusting black-box leaderboards over the community
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Community feed
A focused stream of recent stories from the sources curated for this community. Latest: Community Evals: Because we're done trusting black-box leaderboards over the community, Google adopts Werewolf and Poker in AI benchmark 'Game Arena' - GIGAZINE, and Frontier AI safety regulations: A reference for lab staff. Page 8.
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Google adopts Werewolf and Poker in AI benchmark 'Game Arena' GIGAZINE
Miles Kodama and Michael Chen summarize key provisions from California's SB 53, the EU Code of Practice, and New York's RAISE Act covering frontier AI developers.
New AI benchmark reveals UK agencies are ‘all in’ – but only 2% feel prepared TheBusinessDesk.com
We’re releasing a new version of our time horizon estimates (TH1.1), using more tasks and a new eval infrastructure.
A Blog post by Technology Innovation Institute on Hugging Face
We show preliminary results on a prototype evaluation that tests monitors' ability to catch AI agents doing side tasks, and AI agents' ability to bypass this monitoring.
Thomas Kwa responds to some misinterpretations of our time horizon work, and explains limitations and the core finding.
Large Language Model Evaluation in '26: 10+ Metrics & Methods AIMultiple
A Blog post by IBM Research on Hugging Face
Amazon Bedrock Model Evaluation Tool Demo Amazon Web Services (AWS)
OpenAI plans to test advertising in the U.S. for ChatGPT’s free and Go tiers to expand affordable access to AI worldwide, while protecting privacy, trust, and answer quality.
More stories load automatically as you scroll.