Benchmarks page 10 | evald.ai

evald.ai Entities

OpenAI Evaluation Filter October 10, 2024 10:00

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering.

Hugging Face Evaluation Filter October 04, 2024 00:00

Introducing the Open FinLLM Leaderboard

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Hugging Face Evaluation Filter July 01, 2024 00:00

Our Transformers Code Agent beats the GAIA benchmark 🏅

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Hugging Face Evaluation Filter June 06, 2024 00:00

Launching the Artificial Analysis Text to Image Leaderboard & Arena

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Hugging Face Evaluation Filter May 14, 2024 00:00

Introducing the Open Arabic LLM Leaderboard

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Hugging Face Evaluation Filter May 05, 2024 00:00

Introducing the Open Leaderboard for Hebrew LLMs!

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Hugging Face Evaluation Filter May 03, 2024 00:00

Bringing the Artificial Analysis LLM Performance Leaderboard to Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Hugging Face Evaluation Filter April 23, 2024 00:00

Introducing the Open Chain of Thought Leaderboard

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Hugging Face Evaluation Filter April 19, 2024 00:00

The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Hugging Face Evaluation Filter April 16, 2024 00:00

Introducing the LiveCodeBench Leaderboard - Holistic and Contamination-Free Evaluation of Code LLMs

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Hugging Face Evaluation Filter February 23, 2024 00:00

Introducing the Red-Teaming Resistance Leaderboard

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Hugging Face Evaluation Filter February 20, 2024 00:00

Introducing the Open Ko-LLM Leaderboard: Leading the Korean LLM Evaluation Ecosystem

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks LLM Evaluation

Benchmarks LLM Evaluation