evald.ai page 18

OpenAI Evaluation Filter December 18, 2025 12:00

Evaluating chain-of-thought monitorability

OpenAI introduces a new framework and evaluation suite for chain-of-thought monitorability, covering 13 evaluations across 24 environments. Our findings show that monitoring a model’s internal reasoning is far more effective than monitoring outputs alone,...

OpenAI

OpenAI Evaluation Filter December 18, 2025 11:00

Updating our Model Spec with teen protections

OpenAI is updating its Model Spec with new Under-18 Principles that define how ChatGPT should support teens with safe, age-appropriate guidance grounded in developmental science. The update strengthens guardrails, clarifies expected model behavior in...

Safety Evals

Safety Evals OpenAI ChatGPT

Hugging Face Evaluation Filter December 17, 2025 13:22

The Open Evaluation Standard: Benchmarking NVIDIA Nemotron 3 Nano with NeMo Evaluator

A Blog post by NVIDIA on Hugging Face

NVIDIA

Hacker News LLM Evaluation December 16, 2025 13:28

GitHub - bassrehab/spark-llm-eval: Spark-native LLM evaluation framework with confidence intervals, significance testing, and Databricks integration

Spark-native LLM evaluation framework with confidence intervals, significance testing, and Databricks integration - bassrehab/spark-llm-eval

LLM Evaluation Testing Tools

Testing Tools LLM Evaluation

Google DeepMind Evaluation Filter December 16, 2025 10:14

Gemma Scope 2: Helping the AI Safety Community Deepen Understanding of Complex Language Model Behavior

Announcing Gemma Scope 2, a comprehensive, open suite of interpretability tools for the entire Gemma 3 family to accelerate AI safety research.

Safety Evals

OpenAI Evaluation Filter December 16, 2025 09:00

Evaluating AI’s ability to perform scientific research tasks

OpenAI introduces FrontierScience, a benchmark testing AI reasoning in physics, chemistry, and biology to measure progress toward real scientific research.

Benchmarks Testing Tools

Benchmarks Testing Tools OpenAI

OpenAI Evaluation Filter December 16, 2025 08:00

Measuring AI’s capability to accelerate biological research in the wet lab

OpenAI introduces a real-world evaluation framework to measure how AI can accelerate biological research in the wet lab. Using GPT-5 to optimize a molecular cloning protocol, the work explores both the promise and risks of AI-assisted experimentation.

Testing Tools

Testing Tools OpenAI

OpenAI Evaluation Filter December 11, 2025 10:00

Advancing science and math with GPT-5.2

GPT-5.2 is OpenAI’s strongest model yet for math and science, setting new state-of-the-art results on benchmarks like GPQA Diamond and FrontierMath. This post shows how those gains translate into real research progress, including solving an open theoretical...

Benchmarks

Benchmarks OpenAI

Google News LLM Evaluation December 11, 2025 08:00

GPT-5.2 lands to top Google's Gemini 3 in the AI benchmark game just four weeks after GPT-5.1 - the-decoder.com

GPT-5.2 lands to top Google's Gemini 3 in the AI benchmark game just four weeks after GPT-5.1 the-decoder.com

Benchmarks

Benchmarks Gemini Google

Google DeepMind Evaluation Filter December 11, 2025 00:06

Deepening AI Safety Research with UK AI Security Institute (AISI)

Google DeepMind and the UK AI Security Institute (AISI) strengthen collaboration through a new research partnership, focusing on critical safety research areas like monitoring AI reasoning and evalua…

Safety Evals

Safety Evals Google Google DeepMind

Google DeepMind Evaluation Filter December 09, 2025 11:29

FACTS Benchmark Suite: a new way to systematically evaluate LLMs factuality

The FACTS Benchmark Suite provides a systematic evaluation of Large Language Models (LLMs) factuality across three areas: Parametric, Search, and Multimodal reasoning.

Benchmarks

Google News LLM Evaluation December 09, 2025 08:00

New Benchmark Shows AI Chatbots Are Easily Manipulated - Built In

New Benchmark Shows AI Chatbots Are Easily Manipulated Built In

Benchmarks