METR Blog

Details about METR's preliminary evaluation of DeepSeek-R1

We evaluated DeepSeek-R1 for dangerous autonomous capabilities and found no evidence of dangerous capabilities beyond those of existing models such as Claude 3.5 Sonnet and GPT-4o. Interestingly, we found that it did not do substantially better than...

METR Blog

METR’s GPT-4.5 pre-deployment evaluations

Additional details about our evaluations of GPT-4.5, and some discussion about the limitations of pre-deployment evaluations and current evaluation methodologies.

OpenAI Evaluation Filter

Deep research System Card

This report outlines the safety work carried out prior to releasing deep research including external red teaming, frontier risk evaluations according to our Preparedness Framework, and an overview of the mitigations we built in to address key risk areas.

Safety Evals

METR Blog

Measuring Automated Kernel Engineering

We measured the performance of frontier models at writing GPU kernels. With a small amount of scaffolding, we found that the best model can provide an average speedup on KernelBench of 1.8x.

METR Blog

Details about METR's preliminary evaluation of DeepSeek-V3

We evaluated DeepSeek-V3 for dangerous autonomous capabilities and found no evidence of dangerous capabilities beyond those of existing models such as Claude 3.5 Sonnet and GPT-4o. We also confirmed that its performance on GPQA is not due to training data...

OpenAI Evaluation Filter

OpenAI o3-mini System Card

This report outlines the safety work carried out for the OpenAI o3-mini model, including safety evaluations, external red teaming, and Preparedness Framework evaluations.

Safety Evals

More stories

More stories load automatically as you scroll.