evald.ai page 24

Google News LLM Evaluation September 05, 2024 10:47

A review of model evaluation metrics for machine learning in genetics and genomics - Frontiers

A review of model evaluation metrics for machine learning in genetics and genomics Frontiers

METR Blog August 20, 2024 07:00

Vivaria

Vivaria is METR's tool for running evaluations and conducting agent elicitation research. Vivaria is a web application with which users can interact using a web UI and a command-line interface.

METR Blog August 07, 2024 17:00

Details about METR's preliminary evaluation of GPT-4o

We measured the performance of GPT-4o given a simple agent scaffolding on 77 tasks across 30 task families testing autonomous capabilities.

Testing Tools

METR Blog August 06, 2024 17:00

An update on our general capability evaluations

More tasks, human baselines, and preliminary results for GPT-4 and Claude.

Hugging Face Evaluation Filter July 25, 2024 00:00

LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs - Do We Still Need Fine-Tuning?

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

OpenAI Evaluation Filter July 10, 2024 06:30

OpenAI and Los Alamos National Laboratory announce research partnership

OpenAI and Los Alamos National Laboratory are working to develop safety evaluations to assess and measure biological capabilities and risks associated with frontier models.

Hugging Face Evaluation Filter July 01, 2024 00:00

Our Transformers Code Agent beats the GAIA benchmark 🏅

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

Hugging Face Evaluation Filter June 24, 2024 00:00

Ethics and Society Newsletter #6: Building Better AI: The Importance of Data Quality

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

LLM Evaluation

OpenAI Evaluation Filter June 20, 2024 00:00

Improved Techniques for Training Consistency Models

Consistency models are a nascent family of generative models that can sample high quality data in one step without the need for adversarial training.

LLM Evaluation

Hugging Face Evaluation Filter June 06, 2024 00:00

Launching the Artificial Analysis Text to Image Leaderboard & Arena

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Benchmarks

METR Blog June 02, 2024 18:00

ERROR: The request could not be satisfied

Comments on NIST’s draft document “AI Risk Management Framework: Generative AI Profile.”

Safety Evals

Hugging Face Evaluation Filter May 24, 2024 00:00

CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Testing Tools