evald.ai page 22

Hugging Face Evaluation Filter February 10, 2025 00:00

The Open Arabic LLM Leaderboard 2

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

METR Blog February 08, 2025 16:00

Frontier AI Safety Policies

Model Evaluation & Threat Research

Safety Evals LLM Evaluation

Hugging Face Evaluation Filter February 04, 2025 00:00

DABStep: Data Agent Benchmark for Multi-step Reasoning

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

OpenAI Evaluation Filter January 31, 2025 11:00

OpenAI o3-mini System Card

This report outlines the safety work carried out for the OpenAI o3-mini model, including safety evaluations, external red teaming, and Preparedness Framework evaluations.

Safety Evals

METR Blog January 31, 2025 08:00

An update on our preliminary evaluations of Claude 3.5 Sonnet and o1

Preliminary evaluations of Claude 3.5 Sonnet (New) and o1, as well as some discussion of challenges in making capability-based safety arguments for AI models.

Google News LLM Evaluation January 28, 2025 08:00

Track LLM model evaluation using Amazon SageMaker managed MLflow and FMEval | Amazon Web Services - Amazon Web Services (AWS)

Track LLM model evaluation using Amazon SageMaker managed MLflow and FMEval | Amazon Web Services Amazon Web Services (AWS)

LLM Evaluation

OpenAI Evaluation Filter January 23, 2025 10:00

Drawing from OpenAI’s established safety frameworks, this document highlights our multi-layered approach, including model and product mitigations we’ve implemented to protect against prompt engineering and jailbreaks, protect privacy and security, as well...

METR Blog January 17, 2025 08:00

AI models can be dangerous before public deployment

Why pre-deployment testing is not an adequate framework for AI risk management

Safety Evals Testing Tools

Hugging Face Evaluation Filter January 09, 2025 00:00

CO₂ Emissions and Models Performance: Insights from the Open LLM Leaderboard

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

Google News LLM Evaluation December 11, 2024 08:00

What Makes a Good AI Benchmark? - Stanford HAI

What Makes a Good AI Benchmark? Stanford HAI

Benchmarks

OpenAI Evaluation Filter December 05, 2024 10:00

OpenAI o1 System Card

This report outlines the safety work carried out prior to releasing OpenAI o1 and o1-mini, including external red teaming and frontier risk evaluations according to our Preparedness Framework.

Safety Evals

Hugging Face Evaluation Filter December 04, 2024 00:00

Rethinking LLM Evaluation with 3C3H: AraGen Benchmark and Leaderboard

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks LLM Evaluation