evald.ai Sources

OpenAI Evaluation Filter

Measuring AI’s capability to accelerate biological research in the wet lab

OpenAI introduces a real-world evaluation framework to measure how AI can accelerate biological research in the wet lab. Using GPT-5 to optimize a molecular cloning protocol, the work explores both the promise and risks of AI-assisted experimentation.

OpenAI Evaluation Filter

Advancing science and math with GPT-5.2

GPT-5.2 is OpenAI’s strongest model yet for math and science, setting new state-of-the-art results on benchmarks like GPQA Diamond and FrontierMath. This post shows how those gains translate into real research progress, including solving an open theoretical...

OpenAI Evaluation Filter

OpenAI to acquire Neptune

OpenAI is acquiring Neptune to deepen visibility into model behavior and strengthen the tools researchers use to track experiments and monitor training.

OpenAI Evaluation Filter

Introducing IndQA

OpenAI introduces IndQA, a new benchmark for evaluating AI systems in Indian languages. Built with domain experts, IndQA tests cultural understanding and reasoning across 12 languages and 10 knowledge areas.

OpenAI Evaluation Filter

gpt-oss-safeguard technical report

gpt-oss-safeguard-120b and gpt-oss-safeguard-20b are two open-weight reasoning models post-trained from the gpt-oss models and trained to reason from a provided policy in order to label content under that policy. In this report, we describe...

OpenAI Evaluation Filter

Addendum to GPT-5 System Card: Sensitive conversations

This system card details GPT-5’s improvements in handling sensitive conversations, including new benchmarks for emotional reliance, mental health, and jailbreak resistance.

OpenAI Evaluation Filter

Detecting and reducing scheming in AI models

Apollo Research and OpenAI developed evaluations for hidden misalignment (“scheming”) and found behaviors consistent with scheming in controlled tests across frontier models. The team shared concrete examples and stress tests of an early method to reduce...