evald.ai page 21

METR Blog March 05, 2025 08:00

Details about METR's preliminary evaluation of DeepSeek-R1

We evaluated DeepSeek-R1 for dangerous autonomous capabilities and found no evidence of dangerous capabilities beyond those of existing models such as Claude 3.5 Sonnet and GPT-4o. Interestingly, we found that it did not do substantially better than...

METR Blog February 27, 2025 08:00

METR’s GPT-4.5 pre-deployment evaluations

Additional details about our evaluations of GPT-4.5, and some discussion about the limitations of pre-deployment evaluations and current evaluation methodologies.

OpenAI Evaluation Filter February 25, 2025 10:00

Deep research System Card

This report outlines the safety work carried out prior to releasing deep research including external red teaming, frontier risk evaluations according to our Preparedness Framework, and an overview of the mitigations we built in to address key risk areas.

Safety Evals

OpenAI Evaluation Filter February 18, 2025 10:00

Introducing the SWE-Lancer benchmark

Can frontier LLMs earn $1 million from real-world freelance software engineering?

Benchmarks

METR Blog February 14, 2025 08:00

Measuring Automated Kernel Engineering

We measured the performance of frontier models at writing GPU kernels. With a small amount of scaffolding, we found that the best model can provide an average speedup on KernelBench of 1.8x.

Hugging Face Evaluation Filter February 14, 2025 00:00

Fixing Open LLM Leaderboard with Math-Verify

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

Google News LLM Evaluation February 12, 2025 08:00

LLM-as-a-judge on Amazon Bedrock Model Evaluation | Amazon Web Services - Amazon Web Services (AWS)

LLM-as-a-judge on Amazon Bedrock Model Evaluation | Amazon Web Services Amazon Web Services (AWS)

LLM Evaluation

METR Blog February 12, 2025 08:00

Details about METR's preliminary evaluation of DeepSeek-V3

We evaluated DeepSeek-V3 for dangerous autonomous capabilities and found no evidence of dangerous capabilities beyond those of existing models such as Claude 3.5 Sonnet and GPT-4o. We also confirmed that its performance on GPQA is not due to training data...

Hugging Face Evaluation Filter February 10, 2025 00:00

The Open Arabic LLM Leaderboard 2

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

METR Blog February 08, 2025 16:00

Frontier AI Safety Policies

Model Evaluation & Threat Research

Safety Evals LLM Evaluation

Hugging Face Evaluation Filter February 04, 2025 00:00

DABStep: Data Agent Benchmark for Multi-step Reasoning

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

OpenAI Evaluation Filter January 31, 2025 11:00

OpenAI o3-mini System Card

This report outlines the safety work carried out for the OpenAI o3-mini model, including safety evaluations, external red teaming, and Preparedness Framework evaluations.

Safety Evals