evald.ai page 10

MLCommons Evaluation Filter April 06, 2026 14:57

MLCommons Releases MLPerf Client v1.6 with Performance Optimizations and Enhanced User Experience - MLCommons

MLCommons releases MLPerf Client v1.6 with updated Windows ML and llama.cpp support, Apple MLX improvements for Mac and iPad, and usability enhancements for faster, more reliable AI benchmarking on personal computers.

Benchmarks

MLCommons Evaluation Filter April 01, 2026 14:50

MLCommons Releases New MLPerf Inference v6.0 Benchmark Results - MLCommons

MLCommons releases MLPerf Inference v6.0 results — the most significant benchmark update to date, with new tests for text-to-video, GPT-OSS 120B, DLRMv3, vision-language models, and YOLOv11

Benchmarks

METR Blog April 01, 2026 07:00

Fine-tuning experiments on CoT controllability

We find that a small amount of fine-tuning on instruction following in the CoT generalizes to meaningful increases in CoT controllability on an out-of-distribution set of tasks. We fine-tune four reasoning models on small datasets of instruction-following...

Google News LLM Evaluation March 31, 2026 07:00

EPIC Joins Coalition Comment on NIST Guidance on AI Benchmark Evaluation - EPIC – Electronic Privacy Information Center

EPIC Joins Coalition Comment on NIST Guidance on AI Benchmark Evaluation EPIC – Electronic Privacy Information Center

Benchmarks

Mitchell Bryson AI Reliability Articles March 30, 2026 00:00

OpenAI just showed everyone what an AI company looks like when the math stops working - Mitchell Bryson

In a single week, OpenAI killed Sora ($15M/day burn, $2.1M lifetime revenue), blindsided Disney on a $1B deal, shelved its adult chatbot, renamed its product org to 'AGI Deployment,' moved safety oversight away from the CEO, and bet everything on a model...

OpenAI

Google News LLM Evaluation March 29, 2026 07:00

AI benchmark helps robots plan and complete their chores in the real world - Tech Xplore

AI benchmark helps robots plan and complete their chores in the real world Tech Xplore

Benchmarks

Mitchell Bryson AI Reliability Articles March 29, 2026 00:00

The real AI war is over your memory - Mitchell Bryson

Google launched tools to import your ChatGPT memories and chat histories. Apple is turning Siri into a marketplace where every AI assistant plugs in — for a 30% cut. OpenAI is wiring Codex into every work tool you touch. Shopify made every AI conversation a...

LLM Evaluation

LLM Evaluation OpenAI Google ChatGPT

Mitchell Bryson AI Reliability Articles March 28, 2026 00:00

The leak that repriced cybersecurity - Mitchell Bryson

Anthropic's accidental Mythos reveal crashed cybersecurity stocks — but the market was catching up to a reality that was already here. On the same day, CISA warned of active exploitation of AI agent frameworks, researchers disclosed basic vulnerabilities in...

Anthropic Mythos

Mitchell Bryson AI Reliability Articles March 27, 2026 00:00

The text box was just the prototype - Mitchell Bryson

In a single 48-hour stretch, Apple revealed plans to open Siri to rival chatbots, Mistral shipped an open-source voice model rivaling ElevenLabs, Google launched studio-quality AI music generation, and IBM embedded voice AI into enterprise agents. The chat...

LLM Evaluation

LLM Evaluation Google

Google News LLM Evaluation March 26, 2026 07:00

Is AGI Here? Not Even Close, New AI Benchmark Suggests - Decrypt

Is AGI Here? Not Even Close, New AI Benchmark Suggests Decrypt

Benchmarks

Google News LLM Evaluation March 26, 2026 07:00

The toughest AI benchmark just got a whole lot tougher - Sherwood News

The toughest AI benchmark just got a whole lot tougher Sherwood News

Benchmarks

METR Blog March 26, 2026 07:00

Red-Teaming Anthropic's Internal Agent Monitoring Systems

A METR staff member spent three weeks red-teaming a subset of Anthropic's internal agent monitoring and security systems, discovering several novel vulnerabilities.

Anthropic