Safety Evals

Google News AI Red Teaming May 11, 2026 03:58

Google's top scientist to European Commission: In less than 2 hours, our Red team 'hacked' the system yo - The Times of India

Google's top scientist to European Commission: In less than 2 hours, our Red team 'hacked' the system yo The Times of India

Safety Evals

Safety Evals Google

METR Blog May 08, 2026 07:00

Review of the "Risks from automated R&D" section in the Anthropic Risk Report (February 2026)

External review from METR of the "Risks from automated R&D" section in Anthropic's February 2026 Risk Report

Safety Evals

Anthropic Safety Evals

Google News AI Red Teaming May 05, 2026 18:35

Youth AI Safety Independent Testing Regime Launch - Hot Community Stocks - newser.com

Youth AI Safety Independent Testing Regime Launch - Hot Community Stocks newser.com

Safety Evals Testing Tools

MLCommons Evaluation Filter April 20, 2026 22:10

Fresh Benchmarks, Reliable Scores: Introducing Continuous Prompt Stewardship for AI Risk Evaluation - MLCommons

MLCommons introduces Continuous Prompt Stewardship to keep the AILuminate AI safety benchmark fresh and reliable as frontier models evolve.

Benchmarks Safety Evals

Safety Evals Benchmarks

Google DeepMind Evaluation Filter March 25, 2026 16:46

Protecting People from Harmful Manipulation

Google DeepMind releases new findings and an evaluation framework to measure AI's potential for harmful manipulation in areas like finance and health, with the goal of enhancing AI safety.

Safety Evals Testing Tools

Safety Evals Testing Tools Google Google DeepMind

MLCommons Evaluation Filter March 13, 2026 16:57

Global Standards, Local Ground Truths: Piloting Multilingual, Multimodal AI Safety Understanding in APAC - MLCommons

MLCommons is developing the AILuminate Culturally-Specific Multimodal Benchmark to close the AI performance and representation gap across APAC cultures, languages, and real-world use cases.

Benchmarks Safety Evals

Safety Evals Benchmarks

METR Blog March 12, 2026 07:00

Review of the Anthropic Sabotage Risk Report: Claude Opus 4.6

External review from METR of Anthropic's Sabotage Risk Report for Claude Opus 4.6

Safety Evals

Anthropic Safety Evals Claude Claude Opus

Mitchell Bryson AI Reliability Articles March 03, 2026 00:00

The Pentagon values auction: AI safety gets its market test - Mitchell Bryson

OpenAI amends its Pentagon deal after Altman admits it looked 'opportunistic and sloppy', while Claude surges to number one on the App Store and hundreds of employees publicly back Anthropic's stance.

Safety Evals

Anthropic Safety Evals Claude OpenAI

Mitchell Bryson AI Reliability Articles February 25, 2026 00:00

Pentagon Escalates Dispute with Anthropic, Threatens Defense Production Act - Mitchell Bryson

Defense Secretary Pete Hegseth gives Anthropic until Friday to provide military access to Claude or face being declared a supply chain risk or forced compliance under the Defense Production Act.

Safety Evals

Anthropic Safety Evals Claude

METR Blog February 19, 2026 08:00

Five lessons from having helped run an AI-Biology RCT

Luca Righetti shares takeaways on the role of randomized controlled trials in AI safety testing.

Safety Evals Testing Tools

METR Blog January 29, 2026 22:12

Frontier AI safety regulations: A reference for lab staff

Miles Kodama and Michael Chen summarize key provisions from California's SB 53, the EU Code of Practice, and New York's RAISE Act covering frontier AI developers.

Safety Evals

OpenAI Evaluation Filter December 18, 2025 11:00

Updating our Model Spec with teen protections

OpenAI is updating its Model Spec with new Under-18 Principles that define how ChatGPT should support teens with safe, age-appropriate guidance grounded in developmental science. The update strengthens guardrails, clarifies expected model behavior in...

Safety Evals

Safety Evals OpenAI ChatGPT