Benchmarks page 6

Google News LLM Evaluation February 04, 2026 08:00

NIST Seeks Public Input on Draft Best Practices for Automated AI Benchmark Testing - ExecutiveGov

NIST Seeks Public Input on Draft Best Practices for Automated AI Benchmark Testing ExecutiveGov

Benchmarks Testing Tools

Google News LLM Evaluation February 03, 2026 08:00

Google adopts Werewolf and Poker in AI benchmark 'Game Arena' - GIGAZINE

Google adopts Werewolf and Poker in AI benchmark 'Game Arena' GIGAZINE

Benchmarks

Benchmarks Google

Google News LLM Evaluation January 29, 2026 08:00

New AI benchmark reveals UK agencies are ‘all in’ – but only 2% feel prepared - TheBusinessDesk.com

New AI benchmark reveals UK agencies are ‘all in’ – but only 2% feel prepared TheBusinessDesk.com

Benchmarks

Hugging Face Evaluation Filter January 21, 2026 06:25

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

A Blog post by IBM Research on Hugging Face

Benchmarks

Google News LLM Evaluation January 12, 2026 08:00

Spirit AI Open-Sources Spirit v1.5, Tops Global Embodied AI Benchmark - Pandaily

Spirit AI Open-Sources Spirit v1.5, Tops Global Embodied AI Benchmark Pandaily

Benchmarks

OpenAI Evaluation Filter December 16, 2025 09:00

Evaluating AI’s ability to perform scientific research tasks

OpenAI introduces FrontierScience, a benchmark testing AI reasoning in physics, chemistry, and biology to measure progress toward real scientific research.

Benchmarks Testing Tools

Benchmarks Testing Tools OpenAI

OpenAI Evaluation Filter December 11, 2025 10:00

Advancing science and math with GPT-5.2

GPT-5.2 is OpenAI’s strongest model yet for math and science, setting new state-of-the-art results on benchmarks like GPQA Diamond and FrontierMath. This post shows how those gains translate into real research progress, including solving an open theoretical...

Benchmarks

Benchmarks OpenAI

Google News LLM Evaluation December 11, 2025 08:00

GPT-5.2 lands to top Google's Gemini 3 in the AI benchmark game just four weeks after GPT-5.1 - the-decoder.com

GPT-5.2 lands to top Google's Gemini 3 in the AI benchmark game just four weeks after GPT-5.1 the-decoder.com

Benchmarks

Benchmarks Gemini Google

Google DeepMind Evaluation Filter December 09, 2025 11:29

FACTS Benchmark Suite: a new way to systematically evaluate LLMs factuality

The FACTS Benchmark Suite provides a systematic evaluation of Large Language Models (LLMs) factuality across three areas: Parametric, Search, and Multimodal reasoning.

Benchmarks

Google News LLM Evaluation December 09, 2025 08:00