evald.ai page 29

METR Blog November 22, 2024 08:00

Evaluating frontier AI R&D capabilities of language model agents against human experts

We’re releasing RE-Bench, a new benchmark for measuring the performance of humans and frontier model agents on ML research engineering tasks. We also share data from 71 human expert attempts and results for Anthropic’s Claude 3.5 Sonnet and OpenAI’s...

Benchmarks

Anthropic Claude Benchmarks OpenAI

Hugging Face Evaluation Filter November 20, 2024 00:00

Introducing the Open Leaderboard for Japanese LLMs!

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

METR Blog November 12, 2024 08:00

The Rogue Replication Threat Model

Thoughts on how AI agents might develop large and resilient rogue populations.

Hugging Face Evaluation Filter November 04, 2024 00:00

Argilla 2.4: Easily Build Fine-Tuning and Evaluation Datasets on the Hub — No Code Required

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

OpenAI Evaluation Filter October 30, 2024 10:00

Introducing SimpleQA

A factuality benchmark called SimpleQA that measures the ability for language models to answer short, fact-seeking questions.

Benchmarks

OpenAI Evaluation Filter October 23, 2024 10:00

Simplifying, stabilizing, and scaling continuous-time consistency models

We’ve simplified, stabilized, and scaled continuous-time consistency models, achieving comparable sample quality to leading diffusion models, while using only two sampling steps.

LLM Evaluation

METR Blog October 11, 2024 18:00

ERROR: The request could not be satisfied

Red-teaming and security suggestions regarding proposed rule by the Bureau of Industry and Security, “Establishment of Reporting Requirements for the Development of Advanced Artificial Intelligence Models and Computing Clusters.”

OpenAI Evaluation Filter October 10, 2024 10:00

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

We introduce MLE-bench, a benchmark for measuring how well AI agents perform at machine learning engineering.

Benchmarks

METR Blog October 09, 2024 07:00

New Support Through The Audacious Project

Funding for Canary will enable research and implementation at scale

Hugging Face Evaluation Filter October 04, 2024 00:00

Introducing the Open FinLLM Leaderboard

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

METR Blog September 12, 2024 17:00

Details about METR's preliminary evaluation of OpenAI o1-preview

We measured the performance of OpenAI's o1-mini and o1-preview models on our autonomy and AI R&D task suites, and found they did not exceed the capabilities of the best existing public model we've evaluated, though we could not confidently upper-bound...

OpenAI

METR Blog September 08, 2024 18:00

ERROR: The request could not be satisfied

Suggestions for expanded guidance on capability elicitation and robust model safeguards in the U.S. AI Safety Institute’s draft document “Managing Misuse Risk for Dual-Use Foundation Models” (NIST AI 800-1).

Safety Evals