evald.ai page 4

Google DeepMind Evaluation Filter March 25, 2026 16:46

Protecting People from Harmful Manipulation

Google DeepMind releases new findings and an evaluation framework to measure AI's potential for harmful manipulation in areas like finance and health, with the goal of enhancing AI safety.

OpenAI Evaluation Filter March 25, 2026 10:00

Inside our approach to the Model Spec

Learn how OpenAI’s Model Spec serves as a public framework for model behavior, balancing safety, user freedom, and accountability as AI systems advance.

Google News LLM Evaluation March 25, 2026 07:00

Exclusive: This new benchmark could expose AI’s biggest weakness - Fast Company

Exclusive: This new benchmark could expose AI’s biggest weakness  Fast Company

MLCommons Evaluation Filter March 24, 2026 14:47

A new GPT-OSS benchmark and DeepSeek R1 updates for latency-optimized reasoning - MLCommons

MLPerf Inference v6.0 introduces GPT-OSS 120B, a new open-weight LLM benchmark, plus a DeepSeek-R1 interactive scenario with support for speculative decoding.

Google News LLM Evaluation March 24, 2026 07:00

Outlier Emphasizes Expert Contributor Network for AI Model Evaluation - TipRanks

Outlier Emphasizes Expert Contributor Network for AI Model Evaluation  TipRanks

Google News LLM Evaluation March 20, 2026 07:00

Insilico Medicine Highlights AI Benchmark Results in Cardiovascular Drug Target Discovery - TipRanks

Insilico Medicine Highlights AI Benchmark Results in Cardiovascular Drug Target Discovery  TipRanks

METR Blog March 20, 2026 07:00

Impact of modelling assumptions on time horizon results

Alexander Barry examines how different modelling choices affect METR's time horizon estimates.

MLCommons Evaluation Filter March 19, 2026 18:59

Standardizing Generative AI Service Evaluation: An API-Centric Benchmarking Approach - MLCommons

MLPerf® Endpoints brings API-native benchmarking, Pareto curve visualizations, and rolling submissions to generative AI infrastructure evaluation.

METR Blog March 19, 2026 07:00

We spent 2 hours working in the future

Thomas Kwa describes a tabletop exercise where METR researchers simulated having access to ~200-hour time horizon AIs.

Google DeepMind Evaluation Filter March 17, 2026 16:03

Measuring progress toward AGI: A cognitive framework

Google DeepMind proposes a cognitive framework to evaluate AGI and launches a Kaggle hackathon to build capability benchmarks

Google News LLM Evaluation March 15, 2026 07:00

AI benchmark numbers are meaningless — here’s what to look for instead - MakeUseOf

AI benchmark numbers are meaningless — here’s what to look for instead  MakeUseOf

MLCommons Evaluation Filter March 13, 2026 16:57

Global Standards, Local Ground Truths: Piloting Multilingual, Multimodal AI Safety Understanding in APAC - MLCommons

MLCommons is developing the AILuminate Culturally-Specific Multimodal Benchmark to close the AI performance and representation gap across APAC cultures, languages, and real-world use cases.