OpenAI page 2

OpenAI Evaluation Filter December 18, 2025 12:00

Evaluating chain-of-thought monitorability

OpenAI introduces a new framework and evaluation suite for chain-of-thought monitorability, covering 13 evaluations across 24 environments. Our findings show that monitoring a model’s internal reasoning is far more effective than monitoring outputs alone,...

OpenAI

OpenAI Evaluation Filter December 18, 2025 11:00

Updating our Model Spec with teen protections

OpenAI is updating its Model Spec with new Under-18 Principles that define how ChatGPT should support teens with safe, age-appropriate guidance grounded in developmental science. The update strengthens guardrails, clarifies expected model behavior in...

Safety Evals

Safety Evals OpenAI ChatGPT

OpenAI Evaluation Filter December 16, 2025 09:00

Evaluating AI’s ability to perform scientific research tasks

OpenAI introduces FrontierScience, a benchmark testing AI reasoning in physics, chemistry, and biology to measure progress toward real scientific research.

Benchmarks Testing Tools

Benchmarks Testing Tools OpenAI

OpenAI Evaluation Filter December 16, 2025 08:00

Measuring AI’s capability to accelerate biological research in the wet lab

OpenAI introduces a real-world evaluation framework to measure how AI can accelerate biological research in the wet lab. Using GPT-5 to optimize a molecular cloning protocol, the work explores both the promise and risks of AI-assisted experimentation.

Testing Tools

Testing Tools OpenAI

OpenAI Evaluation Filter December 11, 2025 10:00

Advancing science and math with GPT-5.2

GPT-5.2 is OpenAI’s strongest model yet for math and science, setting new state-of-the-art results on benchmarks like GPQA Diamond and FrontierMath. This post shows how those gains translate into real research progress, including solving an open theoretical...

Benchmarks

Benchmarks OpenAI

OpenAI Evaluation Filter December 03, 2025 10:00

OpenAI to acquire Neptune

OpenAI is acquiring Neptune to deepen visibility into model behavior and strengthen the tools researchers use to track experiments and monitor training.

OpenAI

METR Blog November 19, 2025 08:00

Details about METR's evaluation of OpenAI GPT-5.1-Codex-Max

We evaluate whether GPT-5.1-Codex-Max poses significant catastrophic risks via AI self-improvement, rogue replication, or sabotage of AI labs. We conclude that this seems unlikely.

OpenAI

OpenAI Evaluation Filter November 03, 2025 22:30

Introducing IndQA

OpenAI introduces IndQA, a new benchmark for evaluating AI systems in Indian languages. Built with domain experts, IndQA tests cultural understanding and reasoning across 12 languages and 10 knowledge areas.

Benchmarks

Benchmarks OpenAI

METR Blog October 23, 2025 07:00

Summary of our gpt-oss methodology review

Details on external recommendations from METR for gpt-oss Preparedness experiments and follow-up from OpenAI.

Safety Evals

Safety Evals OpenAI

OpenAI Evaluation Filter September 29, 2025 13:30

Improving support with every interaction at OpenAI

Learn how OpenAI uses AI to enhance support, cutting response times, improving quality, and scaling to meet hypergrowth.

LLM Evaluation

LLM Evaluation OpenAI

OpenAI Evaluation Filter September 25, 2025 09:00

Measuring the performance of our models on real-world tasks

OpenAI introduces GDPval, a new evaluation that measures model performance on real-world economically valuable tasks across 44 occupations.

OpenAI

OpenAI Evaluation Filter September 17, 2025 00:00

Detecting and reducing scheming in AI models

Apollo Research and OpenAI developed evaluations for hidden misalignment (“scheming”) and found behaviors consistent with scheming in controlled tests across frontier models. The team shared concrete examples and stress tests of an early method to reduce...

OpenAI