Testing Tools page 5

Hacker News LLM Evaluation December 19, 2025 20:25

Building an LLM evaluation framework: best practices | Datadog

Explore best practices for building an evaluation framework for production LLM applications.

LLM Evaluation Testing Tools

Testing Tools LLM Evaluation

Hacker News LLM Evaluation December 16, 2025 13:28

GitHub - bassrehab/spark-llm-eval: Spark-native LLM evaluation framework with confidence intervals, significance testing, and Databricks integration

Spark-native LLM evaluation framework with confidence intervals, significance testing, and Databricks integration - bassrehab/spark-llm-eval

LLM Evaluation Testing Tools

Testing Tools LLM Evaluation

OpenAI Evaluation Filter December 16, 2025 09:00

Evaluating AI’s ability to perform scientific research tasks

OpenAI introduces FrontierScience, a benchmark testing AI reasoning in physics, chemistry, and biology to measure progress toward real scientific research.

Benchmarks Testing Tools

Benchmarks Testing Tools OpenAI

OpenAI Evaluation Filter December 16, 2025 08:00

Measuring AI’s capability to accelerate biological research in the wet lab

OpenAI introduces a real-world evaluation framework to measure how AI can accelerate biological research in the wet lab. Using GPT-5 to optimize a molecular cloning protocol, the work explores both the promise and risks of AI-assisted experimentation.

Testing Tools

Testing Tools OpenAI

Google News LLM Evaluation November 02, 2025 07:00

Polish emerges as top language in multilingual AI benchmark testing - PPC Land

Polish emerges as top language in multilingual AI benchmark testing PPC Land

Benchmarks Testing Tools

OpenAI Evaluation Filter September 05, 2025 10:00

Why language models hallucinate

OpenAI’s new research explains why language models hallucinate. The findings show how improved evaluations can enhance AI reliability, honesty, and safety.

Testing Tools

Testing Tools OpenAI

Hacker News LLM Evaluation August 27, 2025 18:52

Stop “vibe testing” your LLMs. It's time for real evals.- Google Developers Blog

Explore Stax, an experimental developer tool that streamlines LLM evaluation with human labelling and scalable LLM-as-a-judge auto-raters for data driven decisions.

LLM Evaluation Testing Tools

Testing Tools LLM Evaluation Google

OpenAI Evaluation Filter August 27, 2025 10:00

Findings from a pilot Anthropic–OpenAI alignment evaluation exercise: OpenAI Safety Tests

OpenAI and Anthropic share findings from a first-of-its-kind joint safety evaluation, testing each other’s models for misalignment, instruction following, hallucinations, jailbreaking, and more—highlighting progress, challenges, and the value of cross-lab...

Testing Tools

Anthropic Testing Tools OpenAI

Hacker News LLM Evaluation August 19, 2025 06:45

Viteval | Next generation LLM evaluation framework powered by Vitest.

Next generation LLM evaluation framework powered by Vitest.

LLM Evaluation Testing Tools

Testing Tools LLM Evaluation

Google News LLM Evaluation August 01, 2025 07:00

MLPerf Client 1.0 AI benchmark released — new testing toolkit sports a GUI, covers more models and tasks, and supports more hardware acceleration paths - Tom's Hardware

MLPerf Client 1.0 AI benchmark released — new testing toolkit sports a GUI, covers more models and tasks, and supports more hardware acceleration paths Tom's Hardware

Benchmarks Testing Tools

METR Blog January 17, 2025 08:00

AI models can be dangerous before public deployment

Why pre-deployment testing is not an adequate framework for AI risk management

Safety Evals Testing Tools

METR Blog August 07, 2024 17:00

Details about METR's preliminary evaluation of GPT-4o

We measured the performance of GPT-4o given a simple agent scaffolding on 77 tasks across 30 task families testing autonomous capabilities.

Testing Tools