evald.ai

  1. Protecting People from Harmful Manipulation
  2. Measuring progress toward AGI: A cognitive framework
  3. Gemma Scope 2: Helping the AI Safety Community Deepen Understanding of Complex Language Model Behavior

View source feed

  1. Google News
  2. Google News
  3. Google News

View source feed

  1. Adding Benchmaxxer Repellant to the Open ASR Leaderboard
  2. QIMMA قِمّة ⛰: A Quality-First Arabic LLM Leaderboard
  3. Community Evals: Because we're done trusting black-box leaderboards over the community

View source feed

  1. Measuring the Self-Reported Impact of Early-2026 AI on Technical Worker Productivity
  2. Review of the "Risks from automated R&D" section in the Anthropic Risk Report (February 2026)
  3. Task Substitution and Uplift

View source feed

  1. GPT-OSS 20B: A Sparse MoE Pretraining Benchmark for MLPerf Training v6.0 - MLCommons
  2. DeepSeek-V3: A Large-Scale MoE Pretraining Benchmark for MLPerf Training v6.0 - MLCommons
  3. Fresh Benchmarks, Reliable Scores: Introducing Continuous Prompt Stewardship for AI Risk Evaluation - MLCommons

View source feed

  1. How enterprises are scaling AI
  2. Creating images with ChatGPT
  3. Using skills

View source feed