Build software better, together
GitHub is where people build software. More than 150 million people use GitHub to discover, fork, and contribute to over 420 million projects.
GitHub is where people build software. More than 150 million people use GitHub to discover, fork, and contribute to over 420 million projects.
I have been reading a ton about LLM evaluation practices over the past few weeks from Anthropic’s engineering blog, Hamel Husain’s practitioner-focused guides, the Evals for AI Engineers book by Shreya Shankar and Hamel Husain, and several eval framework...
In this article, I'll walkthrough everything you need to know about LLM evaluation metrics, with code samples.
Compare LLM models side by side with 3 lines of Python. Track evaluations across GPT, Claude, Llama and any model. Radar charts, scorecards, real-time streaming. Free forever.
Polyglot ontological activations for LLM systems. 68 terms from 20+ traditions mapped to computational patterns, plus 10 algorithms native to the ontology that have no equivalents in standard CS. Includes benchmark suite and a documented evaluation...
Evaluation Framework for LLM applications in Java and Kotlin - dokimos-dev/dokimos
Explore best practices for building an evaluation framework for production LLM applications.
Spark-native LLM evaluation framework with confidence intervals, significance testing, and Databricks integration - bassrehab/spark-llm-eval
smallevals — CPU-fast, GPU-blazing fast offline retrieval evaluation for RAG systems with tiny QA models. - mburaksayici/smallevals
This page automatically loads score data from several LLM leaderboards and shows an interactive chart that tracks how top benchmark results have changed. The chart groups benchmarks by category, hi...
Abstract page for arXiv paper 2511.06346: LPFQA: A Long-Tail Professional Forum-based Benchmark for LLM Evaluation
Multiple-Choice Benchmarks, Verifiers, Leaderboards, and LLM Judges with Code Examples