evald.ai page 19

Google News LLM Evaluation December 09, 2025 08:00

Seekr Introduces SeekrGuard for AI Model Evaluation - ExecutiveBiz

Seekr Introduces SeekrGuard for AI Model Evaluation ExecutiveBiz

LLM Evaluation

METR Blog December 09, 2025 08:00

Common Elements of Frontier AI Safety Policies (December 2025 Update)

Shared components of AI lab commitments to evaluate and mitigate severe risks.

Safety Evals

Google News LLM Evaluation December 08, 2025 23:32

AI Benchmark for Materials Science Research - anl.gov

AI Benchmark for Materials Science Research anl.gov

Benchmarks

Google News LLM Evaluation December 08, 2025 08:00

Top 5 Open-Source LLM Evaluation Platforms - KDnuggets

Top 5 Open-Source LLM Evaluation Platforms KDnuggets

LLM Evaluation

Hacker News LLM Evaluation December 04, 2025 17:48

GitHub - mburaksayici/smallevals: smallevals — CPU-fast, GPU-blazing fast offline retrieval evaluation for RAG systems with tiny QA models.

smallevals — CPU-fast, GPU-blazing fast offline retrieval evaluation for RAG systems with tiny QA models. - mburaksayici/smallevals

Hacker News LLM Evaluation December 04, 2025 02:30

Evaluation Guidebook - a Hugging Face Space by OpenEvals

This page automatically loads score data from several LLM leaderboards and shows an interactive chart that tracks how top benchmark results have changed. The chart groups benchmarks by category, hi...

Benchmarks

OpenAI Evaluation Filter December 03, 2025 10:00

OpenAI to acquire Neptune

OpenAI is acquiring Neptune to deepen visibility into model behavior and strengthen the tools researchers use to track experiments and monitor training.

OpenAI

Google News LLM Evaluation December 01, 2025 08:00

Startup Minitap Tops DeepMind’s Mobile AI Benchmark, Raises $4.1 Million Seed Round - Forbes

Startup Minitap Tops DeepMind’s Mobile AI Benchmark, Raises $4.1 Million Seed Round Forbes

Benchmarks

Hacker News LLM Evaluation November 24, 2025 14:19

LPFQA: A Long-Tail Professional Forum-based Benchmark for LLM Evaluation

Abstract page for arXiv paper 2511.06346: LPFQA: A Long-Tail Professional Forum-based Benchmark for LLM Evaluation

Benchmarks LLM Evaluation

Google News LLM Evaluation November 24, 2025 08:00

A new AI benchmark tests whether chatbots protect human well-being - TechCrunch

A new AI benchmark tests whether chatbots protect human well-being TechCrunch

Benchmarks

Hugging Face Evaluation Filter November 21, 2025 00:00

Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

METR Blog November 19, 2025 08:00

Details about METR's evaluation of OpenAI GPT-5.1-Codex-Max

We evaluate whether GPT-5.1-Codex-Max poses significant catastrophic risks via AI self-improvement, rogue replication, or sabotage of AI labs. We conclude that this seems unlikely.

OpenAI