Benchmarks page 8

Google News LLM Evaluation October 29, 2025 07:00

JetBrains launches AI benchmark platform DPAI Arena - Techzine Global

JetBrains launches AI benchmark platform DPAI Arena Techzine Global

OpenAI Evaluation Filter October 27, 2025 10:00

Addendum to GPT-5 System Card: Sensitive conversations

This system card details GPT-5’s improvements in handling sensitive conversations, including new benchmarks for emotional reliance, mental health, and jailbreak resistance.

Benchmarks

Google News LLM Evaluation October 20, 2025 17:24

AI in Compliance: Insights from the EQS AI Benchmark Report - EQS Group

AI in Compliance: Insights from the EQS AI Benchmark Report EQS Group

Benchmarks

Google News LLM Evaluation October 20, 2025 07:00

Bitdeer AI Benchmark: How It’s Revolutionizing Bitcoin Mining and AI Integration - OKX

Bitdeer AI Benchmark: How It’s Revolutionizing Bitcoin Mining and AI Integration OKX

Benchmarks

Google News LLM Evaluation October 10, 2025 07:00

InferenceMax AI benchmark tests software stacks, efficiency, and TCO — vendor-neutral suite runs nightly and tracks performance changes over time - Tom's Hardware

InferenceMax AI benchmark tests software stacks, efficiency, and TCO — vendor-neutral suite runs nightly and tracks performance changes over time Tom's Hardware

Benchmarks

Hacker News LLM Evaluation October 05, 2025 15:55

Understanding the 4 Main Approaches to LLM Evaluation (From Scratch)

Multiple-Choice Benchmarks, Verifiers, Leaderboards, and LLM Judges with Code Examples

Benchmarks LLM Evaluation

Google News LLM Evaluation September 17, 2025 07:00

MITRE and FAA Introduce Novel Aerospace Large Language Model Evaluation Benchmark - The MITRE Corporation

MITRE and FAA Introduce Novel Aerospace Large Language Model Evaluation Benchmark The MITRE Corporation

Benchmarks LLM Evaluation

Google News LLM Evaluation September 11, 2025 07:00

Intel Chips Excel in AI Benchmark: Will it Boost Prospects? - Zacks Investment Research

Intel Chips Excel in AI Benchmark: Will it Boost Prospects? Zacks Investment Research

Benchmarks

METR Blog August 13, 2025 07:00

Research Update: Algorithmic vs. Holistic Evaluation

Many AI benchmarks use algorithmic scoring to evaluate how well AI systems perform on some set of tasks. However, AI systems often produce code that scores well but isn't production-ready due to issues with test coverage, formatting, and code quality. This...

Benchmarks LLM Evaluation

Google News LLM Evaluation August 06, 2025 07:00

Is your AI benchmark lying to you? - Nature

Is your AI benchmark lying to you? Nature

Benchmarks

Hugging Face Evaluation Filter August 01, 2025 14:25

📚 3LM: A Benchmark for Arabic LLMs in STEM and Code

A Blog post by Technology Innovation Institute on Hugging Face

Benchmarks

Google News LLM Evaluation August 01, 2025 07:00

MLPerf Client 1.0 AI benchmark released — new testing toolkit sports a GUI, covers more models and tasks, and supports more hardware acceleration paths - Tom's Hardware

MLPerf Client 1.0 AI benchmark released — new testing toolkit sports a GUI, covers more models and tasks, and supports more hardware acceleration paths Tom's Hardware

Benchmarks Testing Tools