Benchmarks page 7

Google News LLM Evaluation December 11, 2025 08:00

GPT-5.2 lands to top Google's Gemini 3 in the AI benchmark game just four weeks after GPT-5.1 - the-decoder.com

GPT-5.2 lands to top Google's Gemini 3 in the AI benchmark game just four weeks after GPT-5.1 the-decoder.com

Google DeepMind Evaluation Filter December 09, 2025 11:29

FACTS Benchmark Suite: a new way to systematically evaluate LLMs factuality

The FACTS Benchmark Suite provides a systematic evaluation of Large Language Models (LLMs) factuality across three areas: Parametric, Search, and Multimodal reasoning.

Benchmarks

Google News LLM Evaluation December 09, 2025 08:00

New Benchmark Shows AI Chatbots Are Easily Manipulated - Built In

New Benchmark Shows AI Chatbots Are Easily Manipulated Built In

Benchmarks

Google News LLM Evaluation December 08, 2025 23:32

AI Benchmark for Materials Science Research - anl.gov

AI Benchmark for Materials Science Research anl.gov

Benchmarks

Hacker News LLM Evaluation December 04, 2025 02:30

Evaluation Guidebook - a Hugging Face Space by OpenEvals

This page automatically loads score data from several LLM leaderboards and shows an interactive chart that tracks how top benchmark results have changed. The chart groups benchmarks by category, hi...

Benchmarks

Google News LLM Evaluation December 01, 2025 08:00

Startup Minitap Tops DeepMind’s Mobile AI Benchmark, Raises $4.1 Million Seed Round - Forbes

Startup Minitap Tops DeepMind’s Mobile AI Benchmark, Raises $4.1 Million Seed Round Forbes

Benchmarks

Hacker News LLM Evaluation November 24, 2025 14:19

LPFQA: A Long-Tail Professional Forum-based Benchmark for LLM Evaluation

Abstract page for arXiv paper 2511.06346: LPFQA: A Long-Tail Professional Forum-based Benchmark for LLM Evaluation

Benchmarks LLM Evaluation

Google News LLM Evaluation November 24, 2025 08:00

A new AI benchmark tests whether chatbots protect human well-being - TechCrunch

A new AI benchmark tests whether chatbots protect human well-being TechCrunch

Benchmarks

Hugging Face Evaluation Filter November 21, 2025 00:00

Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

Google News LLM Evaluation November 18, 2025 08:00

Revolutionary Google Gemini 3 Shatters Records with Unprecedented AI Benchmark Scores and Game-Changing Coding App - CryptoRank

Revolutionary Google Gemini 3 Shatters Records with Unprecedented AI Benchmark Scores and Game-Changing Coding App CryptoRank

Benchmarks

Benchmarks Gemini Google

OpenAI Evaluation Filter November 03, 2025 22:30

Introducing IndQA

OpenAI introduces IndQA, a new benchmark for evaluating AI systems in Indian languages. Built with domain experts, IndQA tests cultural understanding and reasoning across 12 languages and 10 knowledge areas.

Benchmarks

Benchmarks OpenAI

Google News LLM Evaluation November 02, 2025 07:00

Polish emerges as top language in multilingual AI benchmark testing - PPC Land

Polish emerges as top language in multilingual AI benchmark testing PPC Land

Benchmarks Testing Tools