evald.ai Sources

OpenAI Evaluation Filter

Why language models hallucinate

OpenAI’s new research explains why language models hallucinate. The findings show how improved evaluations can enhance AI reliability, honesty, and safety.

OpenAI Evaluation Filter

ChatGPT agent System Card

ChatGPT agent System Card: OpenAI’s agentic model unites research, browser automation, and code tools with safeguards under the Preparedness Framework.

OpenAI Evaluation Filter

Introducing HealthBench

HealthBench is a new evaluation benchmark for AI in healthcare which evaluates models in realistic scenarios. Built with input from 250+ physicians, it aims to provide a shared standard for model performance and safety in health.

OpenAI Evaluation Filter

Our updated Preparedness Framework

Sharing our updated framework for measuring and protecting against severe harm from frontier AI capabilities.

OpenAI Evaluation Filter

New funding to build towards AGI

Today we’re announcing new funding—$40B at a $300B post-money valuation, which enables us to push the frontiers of AI research even further, scale our compute infrastructure, and deliver increasingly powerful tools for the 500 million people who use ChatGPT...

OpenAI Evaluation Filter

Deep research System Card

This report outlines the safety work carried out prior to releasing deep research including external red teaming, frontier risk evaluations according to our Preparedness Framework, and an overview of the mitigations we built in to address key risk areas.

OpenAI Evaluation Filter

OpenAI o3-mini System Card

This report outlines the safety work carried out for the OpenAI o3-mini model, including safety evaluations, external red teaming, and Preparedness Framework evaluations.