Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Topic feed
AI benchmarks, leaderboards, and comparative model testing.
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
We’re releasing RE-Bench, a new benchmark for measuring the performance of humans and frontier model agents on ML research engineering tasks. We also share data from 71 human expert attempts and results for Anthropic’s Claude 3.5 Sonnet and OpenAI’s...
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
A factuality benchmark called SimpleQA that measures the ability for language models to answer short, fact-seeking questions.
We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering.
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
We’re on a journey to advance and democratize artificial intelligence through open source and open science.