evald.ai page 26

Google News LLM Evaluation May 20, 2025 07:00

Benchmarking LLMs: A guide to AI model evaluation - TechTarget

Benchmarking LLMs: A guide to AI model evaluation TechTarget

OpenAI Evaluation Filter May 12, 2025 10:30

Introducing HealthBench

HealthBench is a new evaluation benchmark for AI in healthcare which evaluates models in realistic scenarios. Built with input from 250+ physicians, it aims to provide a shared standard for model performance and safety in health.

Benchmarks

Mitchell Bryson AI Reliability Articles May 05, 2025 00:00

RAG data quality at scale: deduplication, semantic chunking, and hybrid retrieval that actually improves answers - Mitchell Bryson

A practical pipeline for high-quality Retrieval-Augmented Generation: remove duplicates, split semantically, fuse lexical + dense search, rerank, and measure.

LLM Evaluation

METR Blog April 16, 2025 07:00

Details about METR's preliminary evaluation of OpenAI's o3 and o4-mini

METR conducted a preliminary evaluation of OpenAI's o3 and o4-mini. The two models displayed higher autonomous capabilities than other public models tested, and o3 appears somewhat prone to "reward hacking".

OpenAI

OpenAI Evaluation Filter April 15, 2025 00:00

Our updated Preparedness Framework

Sharing our updated framework for measuring and protecting against severe harm from frontier AI capabilities.

Safety Evals

OpenAI Evaluation Filter April 10, 2025 10:00

BrowseComp: a benchmark for browsing agents

BrowseComp: a benchmark for browsing agents.

Benchmarks

Hugging Face Evaluation Filter April 08, 2025 00:00

Arabic Leaderboards: Introducing Arabic Instruction Following, Updating AraGen, and More

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

METR Blog April 04, 2025 07:00

Details about METR's preliminary evaluation of Claude 3.7

METR conducted a preliminary evaluation of Claude 3.7 Sonnet. While we failed to find significant evidence for a dangerous level of autonomous capabilities, the model displayed impressive AI R&D capabilities on a subset of RE-Bench which provides the...

Claude

Google DeepMind Evaluation Filter April 02, 2025 13:31

Taking a responsible path to AGI

We’re exploring the frontiers of AGI, prioritizing technical safety, proactive risk assessment, and collaboration with the AI community.

Safety Evals

OpenAI Evaluation Filter April 02, 2025 10:15

PaperBench: Evaluating AI’s Ability to Replicate AI Research

We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research.

Benchmarks

OpenAI Evaluation Filter March 31, 2025 15:00

New funding to build towards AGI

Today we’re announcing new funding—$40B at a $300B post-money valuation, which enables us to push the frontiers of AI research even further, scale our compute infrastructure, and deliver increasingly powerful tools for the 500 million people who use ChatGPT...

ChatGPT

METR Blog March 19, 2025 07:00

Measuring AI Ability to Complete Long Tasks

We propose measuring AI performance in terms of the *length* of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend...