OpenAI Evaluation Filter page 3

OpenAI Evaluation Filter September 05, 2025 10:00

Why language models hallucinate

OpenAI’s new research explains why language models hallucinate. The findings show how improved evaluations can enhance AI reliability, honesty, and safety.

OpenAI Evaluation Filter August 27, 2025 10:00

Findings from a pilot Anthropic–OpenAI alignment evaluation exercise: OpenAI Safety Tests

OpenAI and Anthropic share findings from a first-of-its-kind joint safety evaluation, testing each other’s models for misalignment, instruction following, hallucinations, jailbreaking, and more—highlighting progress, challenges, and the value of cross-lab...

OpenAI Evaluation Filter July 30, 2025 00:00

Intercom's three lessons for creating a sustainable AI advantage

Discover how Intercom built a scalable AI platform with 3 key lessons—from evaluations to architecture—to lead the future of customer support.

OpenAI Evaluation Filter July 17, 2025 10:00

ChatGPT agent System Card

ChatGPT agent System Card: OpenAI’s agentic model unites research, browser automation, and code tools with safeguards under the Preparedness Framework.

OpenAI Evaluation Filter May 12, 2025 10:30

Introducing HealthBench

HealthBench is a new evaluation benchmark for AI in healthcare which evaluates models in realistic scenarios. Built with input from 250+ physicians, it aims to provide a shared standard for model performance and safety in health.

OpenAI Evaluation Filter April 15, 2025 00:00

Our updated Preparedness Framework

Sharing our updated framework for measuring and protecting against severe harm from frontier AI capabilities.

OpenAI Evaluation Filter April 10, 2025 10:00

BrowseComp: a benchmark for browsing agents

BrowseComp: a benchmark for browsing agents.

OpenAI Evaluation Filter April 02, 2025 10:15

PaperBench: Evaluating AI’s Ability to Replicate AI Research

We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research.

OpenAI Evaluation Filter March 31, 2025 15:00

New funding to build towards AGI

Today we’re announcing new funding—$40B at a $300B post-money valuation, which enables us to push the frontiers of AI research even further, scale our compute infrastructure, and deliver increasingly powerful tools for the 500 million people who use ChatGPT...

OpenAI Evaluation Filter February 25, 2025 10:00

Deep research System Card

This report outlines the safety work carried out prior to releasing deep research including external red teaming, frontier risk evaluations according to our Preparedness Framework, and an overview of the mitigations we built in to address key risk areas.

OpenAI Evaluation Filter February 18, 2025 10:00

Introducing the SWE-Lancer benchmark

Can frontier LLMs earn $1 million from real-world freelance software engineering?

OpenAI Evaluation Filter January 31, 2025 11:00

OpenAI o3-mini System Card

This report outlines the safety work carried out for the OpenAI o3-mini model, including safety evaluations, external red teaming, and Preparedness Framework evaluations.