Benchmarks page 6

Google News LLM Evaluation February 12, 2026 08:00

Tether EVO Scores Top 5 In Global AI Benchmark for Brain-to-Text AI Challenge - Cryptonews.net

Tether EVO Scores Top 5 In Global AI Benchmark for Brain-to-Text AI Challenge Cryptonews.net

Benchmarks

Google News LLM Evaluation February 12, 2026 08:00

1Password open sources a benchmark to stop AI agents from leaking credentials - Help Net Security

1Password open sources a benchmark to stop AI agents from leaking credentials Help Net Security

Benchmarks

Hacker News LLM Evaluation February 10, 2026 04:43

Dharma_Code/paper/vocab_priming_confound.pdf at main · Palmerschallon/Dharma_Code

Polyglot ontological activations for LLM systems. 68 terms from 20+ traditions mapped to computational patterns, plus 10 algorithms native to the ontology that have no equivalents in standard CS. Includes benchmark suite and a documented evaluation...

Benchmarks

Google News LLM Evaluation February 09, 2026 08:00

University of Manchester academics contribute to the toughest AI benchmark - The University of Manchester

University of Manchester academics contribute to the toughest AI benchmark The University of Manchester

Benchmarks

Google News LLM Evaluation February 04, 2026 20:24

Joel Becker: Reconciling Impressive AI Benchmark Performance with Limited Developer Productivity Impacts - Stanford Digital Economy Lab

Joel Becker: Reconciling Impressive AI Benchmark Performance with Limited Developer Productivity Impacts Stanford Digital Economy Lab

Benchmarks

Google News LLM Evaluation February 04, 2026 08:00

NIST Seeks Public Input on Draft Best Practices for Automated AI Benchmark Testing - ExecutiveGov

NIST Seeks Public Input on Draft Best Practices for Automated AI Benchmark Testing ExecutiveGov

Benchmarks Testing Tools

Google News LLM Evaluation February 03, 2026 08:00

Google adopts Werewolf and Poker in AI benchmark 'Game Arena' - GIGAZINE

Google adopts Werewolf and Poker in AI benchmark 'Game Arena' GIGAZINE

Benchmarks

Benchmarks Google

Google News LLM Evaluation January 29, 2026 08:00

New AI benchmark reveals UK agencies are ‘all in’ – but only 2% feel prepared - TheBusinessDesk.com

New AI benchmark reveals UK agencies are ‘all in’ – but only 2% feel prepared TheBusinessDesk.com

Benchmarks

Hugging Face Evaluation Filter January 21, 2026 06:25

AssetOpsBench: Bridging the Gap Between AI Agent Benchmarks and Industrial Reality

A Blog post by IBM Research on Hugging Face

Benchmarks

Google News LLM Evaluation January 12, 2026 08:00

Spirit AI Open-Sources Spirit v1.5, Tops Global Embodied AI Benchmark - Pandaily

Spirit AI Open-Sources Spirit v1.5, Tops Global Embodied AI Benchmark Pandaily

Benchmarks

OpenAI Evaluation Filter December 16, 2025 09:00

Evaluating AI’s ability to perform scientific research tasks

OpenAI introduces FrontierScience, a benchmark testing AI reasoning in physics, chemistry, and biology to measure progress toward real scientific research.

Benchmarks Testing Tools

Benchmarks Testing Tools OpenAI

OpenAI Evaluation Filter December 11, 2025 10:00

Advancing science and math with GPT-5.2

GPT-5.2 is OpenAI’s strongest model yet for math and science, setting new state-of-the-art results on benchmarks like GPQA Diamond and FrontierMath. This post shows how those gains translate into real research progress, including solving an open theoretical...

Benchmarks

Benchmarks OpenAI