Safety Evals

METR Blog May 08, 2026 07:00

Review of the "Risks from automated R&D" section in the Anthropic Risk Report (February 2026)

External review from METR of the "Risks from automated R&D" section in Anthropic's February 2026 Risk Report

MLCommons Evaluation Filter April 20, 2026 22:10

Fresh Benchmarks, Reliable Scores: Introducing Continuous Prompt Stewardship for AI Risk Evaluation - MLCommons

MLCommons introduces Continuous Prompt Stewardship to keep the AILuminate AI safety benchmark fresh and reliable as frontier models evolve.

Benchmarks Safety Evals

Google DeepMind Evaluation Filter March 25, 2026 16:46

Protecting People from Harmful Manipulation

Google DeepMind releases new findings and an evaluation framework to measure AI's potential for harmful manipulation in areas like finance and health, with the goal of enhancing AI safety.

Safety Evals Testing Tools

MLCommons Evaluation Filter March 13, 2026 16:57

Global Standards, Local Ground Truths: Piloting Multilingual, Multimodal AI Safety Understanding in APAC - MLCommons

MLCommons is developing the AILuminate Culturally-Specific Multimodal Benchmark to close the AI performance and representation gap across APAC cultures, languages, and real-world use cases.

Benchmarks Safety Evals

METR Blog March 12, 2026 07:00

Review of the Anthropic Sabotage Risk Report: Claude Opus 4.6

External review from METR of Anthropic's Sabotage Risk Report for Claude Opus 4.6

Safety Evals

METR Blog February 19, 2026 08:00

Five lessons from having helped run an AI-Biology RCT

Luca Righetti shares takeaways on the role of randomized controlled trials in AI safety testing.

Safety Evals Testing Tools

METR Blog January 29, 2026 22:12

Frontier AI safety regulations: A reference for lab staff

Miles Kodama and Michael Chen summarize key provisions from California's SB 53, the EU Code of Practice, and New York's RAISE Act covering frontier AI developers.

Safety Evals

OpenAI Evaluation Filter December 18, 2025 11:00

Updating our Model Spec with teen protections

OpenAI is updating its Model Spec with new Under-18 Principles that define how ChatGPT should support teens with safe, age-appropriate guidance grounded in developmental science. The update strengthens guardrails, clarifies expected model behavior in...

Safety Evals

Google DeepMind Evaluation Filter December 16, 2025 10:14

Gemma Scope 2: Helping the AI Safety Community Deepen Understanding of Complex Language Model Behavior

Announcing Gemma Scope 2, a comprehensive, open suite of interpretability tools for the entire Gemma 3 family to accelerate AI safety research.

Safety Evals

Google DeepMind Evaluation Filter December 11, 2025 00:06

Deepening AI Safety Research with UK AI Security Institute (AISI)

Google DeepMind and the UK AI Security Institute (AISI) strengthen collaboration through a new research partnership, focusing on critical safety research areas like monitoring AI reasoning and evalua…

Safety Evals

METR Blog December 09, 2025 08:00

Common Elements of Frontier AI Safety Policies (December 2025 Update)

Shared components of AI lab commitments to evaluate and mitigate severe risks.

Safety Evals

METR Blog October 28, 2025 09:47

Review of the Anthropic Summer 2025 Pilot Sabotage Risk Report

External review from METR of Anthropic's Summer 2025 Sabotage Risk Report

Safety Evals