evald.ai page 11

Google DeepMind Evaluation Filter March 25, 2026 16:46

Protecting People from Harmful Manipulation

Google DeepMind releases new findings and an evaluation framework to measure AI's potential for harmful manipulation in areas like finance and health, with the goal of enhancing AI safety.

Safety Evals Testing Tools

Safety Evals Testing Tools Google Google DeepMind

OpenAI Evaluation Filter March 25, 2026 10:00

Inside our approach to the Model Spec

Learn how OpenAI’s Model Spec serves as a public framework for model behavior, balancing safety, user freedom, and accountability as AI systems advance.

OpenAI

Google News LLM Evaluation March 25, 2026 07:00

Exclusive: This new benchmark could expose AI’s biggest weakness - Fast Company

Exclusive: This new benchmark could expose AI’s biggest weakness Fast Company

Benchmarks

Mitchell Bryson AI Reliability Articles March 25, 2026 00:00

The verification tax - Mitchell Bryson

AI makes generating output almost free. But every AI output still needs checking — and checking doesn't scale with compute. The verification tax is the hidden cost most businesses ignore when deploying AI.

MLCommons Evaluation Filter March 24, 2026 14:47

A new GPT-OSS benchmark and DeepSeek R1 updates for latency-optimized reasoning - MLCommons

MLPerf Inference v6.0 introduces GPT-OSS 120B, a new open-weight LLM benchmark, plus a DeepSeek-R1 interactive scenario with support for speculative decoding.

Benchmarks

Google News LLM Evaluation March 24, 2026 07:00

Outlier Emphasizes Expert Contributor Network for AI Model Evaluation - TipRanks

Outlier Emphasizes Expert Contributor Network for AI Model Evaluation TipRanks

LLM Evaluation

Mitchell Bryson AI Reliability Articles March 24, 2026 00:00

The land grab has gone financial - Mitchell Bryson

OpenAI's 17.5% guaranteed-return PE pitch, its 450,000 sq ft campus lease, and the Helion fusion deal all point to the same shift: the AI race is no longer about who has the best model — it's about who can lock in distribution, real estate, and energy...

Benchmarks

Anthropic Benchmarks OpenAI

Google News LLM Evaluation March 20, 2026 07:00

Insilico Medicine Highlights AI Benchmark Results in Cardiovascular Drug Target Discovery - TipRanks

Insilico Medicine Highlights AI Benchmark Results in Cardiovascular Drug Target Discovery TipRanks

Benchmarks

Benchmarks Target

METR Blog March 20, 2026 07:00

Impact of modelling assumptions on time horizon results

Alexander Barry examines how different modelling choices affect METR's time horizon estimates.

MLCommons Evaluation Filter March 19, 2026 18:59

Standardizing Generative AI Service Evaluation: An API-Centric Benchmarking Approach - MLCommons

MLPerf® Endpoints brings API-native benchmarking, Pareto curve visualizations, and rolling submissions to generative AI infrastructure evaluation.

Benchmarks

METR Blog March 19, 2026 07:00

We spent 2 hours working in the future

Thomas Kwa describes a tabletop exercise where METR researchers simulated having access to ~200-hour time horizon AIs.

Google News LLM Evaluation March 18, 2026 07:00

Arena Leaderboard: The Unbreakable Ranking System That’s Revolutionizing AI Model Evaluation - CryptoRank

Arena Leaderboard: The Unbreakable Ranking System That’s Revolutionizing AI Model Evaluation CryptoRank

Benchmarks LLM Evaluation