evald.ai page 28

METR Blog March 15, 2024 11:00

Example autonomy evaluation protocol

An example protocol for the whole evaluation process, based on our task suite, elicitation protocol, and scoring methods.

METR Blog March 15, 2024 10:00

GitHub - METR/public-tasks

Contribute to METR/public-tasks development by creating an account on GitHub.

METR Blog March 15, 2024 09:00

Guidelines for capability elicitation

Priorities for approximating the full potential capability of an AI agent, and recommended checks for evaluation validity.

METR Blog March 15, 2024 08:00

Measuring the impact of post-training enhancements

A brief research report outlining quantitative research which could inform the "safety margin" to add to take into account further post-training enhancements to agent capability.

METR Blog February 29, 2024 08:00

Portable Evaluation Tasks via the METR Task Standard

METR has published a standard way to define tasks for evaluating the capabilities of AI agents.

Hugging Face Evaluation Filter February 23, 2024 00:00

Introducing the Red-Teaming Resistance Leaderboard

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

Hugging Face Evaluation Filter February 20, 2024 00:00

Introducing the Open Ko-LLM Leaderboard: Leading the Korean LLM Evaluation Ecosystem

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks LLM Evaluation

METR Blog February 07, 2024 08:00

2023 Year In Review

A summary of what METR accomplished in 2023 – our first full year of operation.

Hugging Face Evaluation Filter February 02, 2024 00:00

NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

OpenAI Evaluation Filter January 31, 2024 08:00

Building an early warning system for LLM-aided biological threat creation

We’re developing a blueprint for evaluating the risk that a large language model (LLM) could aid someone in creating a biological threat. In an evaluation involving both biology experts and students, we found that GPT-4 provides at most a mild uplift in...

Safety Evals

Hugging Face Evaluation Filter January 31, 2024 00:00

Introducing the Enterprise Scenarios Leaderboard: a Leaderboard for Real World Use Cases

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

Hugging Face Evaluation Filter January 29, 2024 00:00

The Hallucinations Leaderboard, an Open Effort to Measure Hallucinations in Large Language Models

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks