Benchmarks page 9

Hugging Face Evaluation Filter December 04, 2024 00:00

Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

METR Blog November 22, 2024 08:00

Evaluating frontier AI R&D capabilities of language model agents against human experts

We’re releasing RE-Bench, a new benchmark for measuring the performance of humans and frontier model agents on ML research engineering tasks. We also share data from 71 human expert attempts and results for Anthropic’s Claude 3.5 Sonnet and OpenAI’s...

Benchmarks

Hugging Face Evaluation Filter November 20, 2024 00:00

Introducing the Open Leaderboard for Japanese LLMs!

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

OpenAI Evaluation Filter October 30, 2024 10:00

Introducing SimpleQA

A factuality benchmark called SimpleQA that measures the ability for language models to answer short, fact-seeking questions.

Benchmarks

OpenAI Evaluation Filter October 10, 2024 10:00

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering.

Benchmarks

Hugging Face Evaluation Filter October 04, 2024 00:00

Introducing the Open FinLLM Leaderboard

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

Hugging Face Evaluation Filter July 01, 2024 00:00

Our Transformers Code Agent beats the GAIA benchmark 🏅

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

Hugging Face Evaluation Filter June 06, 2024 00:00

Launching the Artificial Analysis Text to Image Leaderboard & Arena

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

Hugging Face Evaluation Filter May 14, 2024 00:00

Introducing the Open Arabic LLM Leaderboard

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

Hugging Face Evaluation Filter May 05, 2024 00:00

Introducing the Open Leaderboard for Hebrew LLMs!

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

Hugging Face Evaluation Filter May 03, 2024 00:00

Bringing the Artificial Analysis LLM Performance Leaderboard to Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

Hugging Face Evaluation Filter April 23, 2024 00:00

Introducing the Open Chain of Thought Leaderboard

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks