evald.ai page 7

OpenAI Evaluation Filter February 18, 2026 00:00

Introducing EVMbench

OpenAI and Paradigm introduce EVMbench, a benchmark evaluating AI agents’ ability to detect, patch, and exploit high-severity smart contract vulnerabilities.

METR Blog February 18, 2026 00:00

How We Protect Confidential Information

Our high-level approach to protecting confidential access and information

METR Blog February 17, 2026 08:00

Analyzing coding agent transcripts to upper bound productivity gains from AI agents

Amy Deng investigates whether coding agent transcripts could serve as an alternative for estimating AI productivity uplift, using 5305 Claude Code transcripts from METR technical staff.

METR Blog February 13, 2026 08:00

Measuring Time Horizon using Claude Code and Codex

Nikola Jurkovic describes our measurements of time horizon using Claude Code and Codex scaffolds.

Google News LLM Evaluation February 12, 2026 08:00

1Password open sources a benchmark to stop AI agents from leaking credentials - Help Net Security

1Password open sources a benchmark to stop AI agents from leaking credentials  Help Net Security

Google News LLM Evaluation February 12, 2026 08:00

Tether EVO Scores Top 5 In Global AI Benchmark for Brain-to-Text AI Challenge - Tether.io

Tether EVO Scores Top 5 In Global AI Benchmark for Brain-to-Text AI Challenge  Tether.io

METR Blog February 10, 2026 08:00

A simpler AI timelines model predicts 99% AI R&D automation in ~2032

Thomas Kwa describes a simple model for forecasting when AI will automate AI development, based on the AI Futures model but with only 8 parameters.

Google News LLM Evaluation February 09, 2026 08:00

University of Manchester academics contribute to the toughest AI benchmark - The University of Manchester

University of Manchester academics contribute to the toughest AI benchmark  The University of Manchester

Google News LLM Evaluation February 09, 2026 08:00

Predicting to New Geographic Regions with Spatially Aware Model Evaluation - Esri

Predicting to New Geographic Regions with Spatially Aware Model Evaluation  Esri

Google News LLM Evaluation February 05, 2026 08:00

Databricks adds MemAlign to MLflow to cut cost and latency of LLM evaluation - InfoWorld

Databricks adds MemAlign to MLflow to cut cost and latency of LLM evaluation  InfoWorld

Google News LLM Evaluation February 04, 2026 20:24

Joel Becker: Reconciling Impressive AI Benchmark Performance with Limited Developer Productivity Impacts - Stanford Digital Economy Lab

Joel Becker: Reconciling Impressive AI Benchmark Performance with Limited Developer Productivity Impacts  Stanford Digital Economy Lab

Google News LLM Evaluation February 04, 2026 08:00

NIST Seeks Public Input on Draft Best Practices for Automated AI Benchmark Testing - ExecutiveGov

NIST Seeks Public Input on Draft Best Practices for Automated AI Benchmark Testing  ExecutiveGov