evald.ai Topics

MLCommons Evaluation Filter

Bringing Text-to-Video to MLPerf Inference v6.0 - MLCommons

MLCommons introduces the new Text-to-Video benchmark in MLPerf Inference v6.0, based on the Wan2.2-T2V-A14B-Diffusers model and validated using the VBench framework. Learn about the key architectural decisions, including the adoption of the SingleStream...

Benchmarks

METR Blog

Many SWE-bench-Passing PRs Would Not Be Merged into Main

We find that roughly half of test-passing SWE-bench Verified PRs written by recent AI agents would not be merged into main by repo maintainers. A naive interpretation of benchmark scores may lead one to overestimate how useful agents are without more...

Benchmarks