evald.ai page 27

METR Blog March 17, 2025 09:00

HCAST: Human-Calibrated Autonomy Software Tasks

Abstract page for arXiv paper 2503.17354: HCAST: Human-Calibrated Autonomy Software Tasks

METR Blog March 15, 2025 09:00

Response to OSTP on AI Action Plan

Suggested priorities for the Office of Science and Technology Policy as it develops an AI Action Plan.

METR Blog March 11, 2025 07:00

Why it’s good for AI reasoning to be legible and faithful

Why legible and faithful reasoning is valuable for safely developing powerful AI

METR Blog March 05, 2025 08:00

Details about METR's preliminary evaluation of DeepSeek-R1

We evaluated DeepSeek-R1 for dangerous autonomous capabilities and found no evidence of dangerous capabilities beyond those of existing models such as Claude 3.5 Sonnet and GPT-4o. Interestingly, we found that it did not do substantially better than...

Claude

METR Blog February 27, 2025 08:00

METR’s GPT-4.5 pre-deployment evaluations

Additional details about our evaluations of GPT-4.5, and some discussion about the limitations of pre-deployment evaluations and current evaluation methodologies.

OpenAI Evaluation Filter February 25, 2025 10:00

Deep research System Card

This report outlines the safety work carried out prior to releasing deep research including external red teaming, frontier risk evaluations according to our Preparedness Framework, and an overview of the mitigations we built in to address key risk areas.

Safety Evals

OpenAI Evaluation Filter February 18, 2025 10:00

Introducing the SWE-Lancer benchmark

Can frontier LLMs earn $1 million from real-world freelance software engineering?

Benchmarks

Mitchell Bryson AI Reliability Articles February 17, 2025 00:00

EU/UK AI compliance in 2025: mapping the ICO risk toolkit to EU AI Act deadlines for product teams - Mitchell Bryson

What the EU AI Act requires in 2025–2027, how it lines up with the UK ICO's AI & Data Protection Risk Toolkit, and the exact outputs your team should ship.

Safety Evals

METR Blog February 14, 2025 08:00

Measuring Automated Kernel Engineering

We measured the performance of frontier models at writing GPU kernels. With a small amount of scaffolding, we found that the best model can provide an average speedup on KernelBench of 1.8x.

Hugging Face Evaluation Filter February 14, 2025 00:00

Fixing Open LLM Leaderboard with Math-Verify

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

Google News LLM Evaluation February 12, 2025 08:00

LLM-as-a-judge on Amazon Bedrock Model Evaluation | Amazon Web Services - Amazon Web Services (AWS)

LLM-as-a-judge on Amazon Bedrock Model Evaluation | Amazon Web Services Amazon Web Services (AWS)

LLM Evaluation

METR Blog February 12, 2025 08:00

Details about METR's preliminary evaluation of DeepSeek-V3

We evaluated DeepSeek-V3 for dangerous autonomous capabilities and found no evidence of dangerous capabilities beyond those of existing models such as Claude 3.5 Sonnet and GPT-4o. We also confirmed that its performance on GPQA is not due to training data...

Claude