OpenAI Evaluation Filter page 2

OpenAI Evaluation Filter December 16, 2025 09:00

Evaluating AI’s ability to perform scientific research tasks

OpenAI introduces FrontierScience, a benchmark testing AI reasoning in physics, chemistry, and biology to measure progress toward real scientific research.

OpenAI Evaluation Filter December 16, 2025 08:00

Measuring AI’s capability to accelerate biological research in the wet lab

OpenAI introduces a real-world evaluation framework to measure how AI can accelerate biological research in the wet lab. Using GPT-5 to optimize a molecular cloning protocol, the work explores both the promise and risks of AI-assisted experimentation.

OpenAI Evaluation Filter December 11, 2025 10:00

Advancing science and math with GPT-5.2

GPT-5.2 is OpenAI’s strongest model yet for math and science, setting new state-of-the-art results on benchmarks like GPQA Diamond and FrontierMath. This post shows how those gains translate into real research progress, including solving an open theoretical...

OpenAI Evaluation Filter December 03, 2025 10:00

OpenAI to acquire Neptune

OpenAI is acquiring Neptune to deepen visibility into model behavior and strengthen the tools researchers use to track experiments and monitor training.

OpenAI Evaluation Filter November 19, 2025 00:00

How Scania accelerates work with AI across its global workforce

Global manufacturer Scania is scaling AI with ChatGPT Enterprise. With team-based onboarding and strong guardrails, AI is boosting productivity, quality, and innovation.

OpenAI Evaluation Filter November 12, 2025 00:00

GPT-5.1 Instant and GPT-5.1 Thinking System Card Addendum

This GPT-5 system card addendum provides updated safety metrics for GPT-5.1 Instant and Thinking, including new evaluations for mental health and emotional reliance.

OpenAI Evaluation Filter November 03, 2025 22:30

Introducing IndQA

OpenAI introduces IndQA, a new benchmark for evaluating AI systems in Indian languages. Built with domain experts, IndQA tests cultural understanding and reasoning across 12 languages and 10 knowledge areas.

OpenAI Evaluation Filter October 29, 2025 00:00

gpt-oss-safeguard technical report

gpt-oss-safeguard-120b and gpt-oss-safeguard-20b are two open-weight reasoning models post-trained from the gpt-oss models and trained to reason from a provided policy in order to label content under that policy. In this report, we describe...

OpenAI Evaluation Filter October 27, 2025 10:00

Addendum to GPT-5 System Card: Sensitive conversations

This system card details GPT-5’s improvements in handling sensitive conversations, including new benchmarks for emotional reliance, mental health, and jailbreak resistance.

OpenAI Evaluation Filter September 29, 2025 13:30

Improving support with every interaction at OpenAI

Learn how OpenAI uses AI to enhance support, cutting response times, improving quality, and scaling to meet hypergrowth.

OpenAI Evaluation Filter September 25, 2025 09:00

Measuring the performance of our models on real-world tasks

OpenAI introduces GDPval, a new evaluation that measures model performance on real-world economically valuable tasks across 44 occupations.

OpenAI Evaluation Filter September 17, 2025 00:00

Detecting and reducing scheming in AI models

Apollo Research and OpenAI developed evaluations for hidden misalignment (“scheming”) and found behaviors consistent with scheming in controlled tests across frontier models. The team shared concrete examples and stress tests of an early method to reduce...