Introducing EVMbench
OpenAI and Paradigm introduce EVMbench, a benchmark evaluating AI agents’ ability to detect, patch, and exploit high-severity smart contract vulnerabilities.
Community feed
A focused stream of recent stories from the sources curated for this community. Latest: Introducing EVMbench, How We Protect Confidential Information, and Analyzing coding agent transcripts to upper bound productivity gains from AI agents. Page 7.
OpenAI and Paradigm introduce EVMbench, a benchmark evaluating AI agents’ ability to detect, patch, and exploit high-severity smart contract vulnerabilities.
Our high-level approach to protecting confidential access and information
Amy Deng investigates whether coding agent transcripts could serve as an alternative for estimating AI productivity uplift, using 5305 Claude Code transcripts from METR technical staff.
Nikola Jurkovic describes our measurements of time horizon using Claude Code and Codex scaffolds.
1Password open sources a benchmark to stop AI agents from leaking credentials Help Net Security
Tether EVO Scores Top 5 In Global AI Benchmark for Brain-to-Text AI Challenge Tether.io
Thomas Kwa describes a simple model for forecasting when AI will automate AI development, based on the AI Futures model but with only 8 parameters.
University of Manchester academics contribute to the toughest AI benchmark The University of Manchester
Predicting to New Geographic Regions with Spatially Aware Model Evaluation Esri
Databricks adds MemAlign to MLflow to cut cost and latency of LLM evaluation InfoWorld
Joel Becker: Reconciling Impressive AI Benchmark Performance with Limited Developer Productivity Impacts Stanford Digital Economy Lab
NIST Seeks Public Input on Draft Best Practices for Automated AI Benchmark Testing ExecutiveGov
More stories load automatically as you scroll.