evald.ai page 13

METR Blog October 06, 2025 07:00

Early Results on Monitorability in QA Settings

Vincent Cheng, Thomas Kwa, and Neev Parikh share research on how AI agents can hide secondary task-solving from monitors, finding that harder tasks are more detectable and small models can learn to evade larger monitors.

Hugging Face Evaluation Filter October 01, 2025 00:00

Introducing RTEB: A New Standard for Retrieval Evaluation

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

OpenAI Evaluation Filter September 29, 2025 13:30

Improving support with every interaction at OpenAI

Learn how OpenAI uses AI to enhance support, cutting response times, improving quality, and scaling to meet hypergrowth.

Google News LLM Evaluation September 29, 2025 07:00

Google Stax Aims to Make AI Model Evaluation Accessible for Developers - infoq.com

Google Stax Aims to Make AI Model Evaluation Accessible for Developers  infoq.com

OpenAI Evaluation Filter September 25, 2025 09:00

Measuring the performance of our models on real-world tasks

OpenAI introduces GDPval, a new evaluation that measures model performance on real-world economically valuable tasks across 44 occupations.

Google News LLM Evaluation September 24, 2025 07:00

Cambridge scientists’ Trismik snaps £2.2M to redefine AI model evaluation using psychometrics - Tech Funding News

Cambridge scientists’ Trismik snaps £2.2M to redefine AI model evaluation using psychometrics  Tech Funding News

Google News LLM Evaluation September 23, 2025 07:00

IBM named a leader in the 2025 IDC Marketscape Worldwide GenAI Model Evaluation - IBM

IBM named a leader in the 2025 IDC Marketscape Worldwide GenAI Model Evaluation  IBM

Google News LLM Evaluation September 17, 2025 07:00

MITRE and FAA Introduce Novel Aerospace Large Language Model Evaluation Benchmark - The MITRE Corporation

MITRE and FAA Introduce Novel Aerospace Large Language Model Evaluation Benchmark  The MITRE Corporation

OpenAI Evaluation Filter September 17, 2025 00:00

Detecting and reducing scheming in AI models

Apollo Research and OpenAI developed evaluations for hidden misalignment (“scheming”) and found behaviors consistent with scheming in controlled tests across frontier models. The team shared concrete examples and stress tests of an early method to reduce...

Google News LLM Evaluation September 11, 2025 07:00

evald.ai