METR Blog

Many SWE-bench-Passing PRs Would Not Be Merged into Main

We find that roughly half of test-passing SWE-bench Verified PRs written by recent AI agents would not be merged into main by repo maintainers. A naive interpretation of benchmark scores may lead one to overestimate how useful agents are without more...

OpenAI Evaluation Filter

Scaling AI for everyone

Today we’re announcing $110B in new investment at a $730B pre money valuation. This includes $30B from SoftBank, $30B from NVIDIA, and $50B from Amazon.

More stories

More stories load automatically as you scroll.