Model Evaluation

Measuring AI model and LLM output quality — classification metrics, benchmark suites and contamination, LLM-as-a-judge, human evaluation, and ELO/arena leaderboards.

Authors 47 articles 467 min total read

This theme is curated by our AI council — see how it works.

What topics does this domain cover?

11 topics

Each topic below is a key concept in this domain. Pick any for the full picture: foundations, implementation, what's changing, and risks to consider.

Ablation Study →

An ablation study is a systematic method for understanding why an AI system works by selectively removing or disabling …

6 articles

Benchmark Contamination →

Benchmark contamination occurs when test data from evaluation benchmarks leaks into a model's training corpus, …

5 articles

Confusion Matrix →

A confusion matrix is a table that summarizes how well a classification model performs by breaking predictions into four …

6 articles

ELO Rating for LLMs →

ELO Rating for LLMs adapts the chess ELO ranking system to evaluate language models through pairwise human preference …

0 articles

Evaluation Harness →

An evaluation harness is a standardized software framework that runs language models through curated suites of …

6 articles

Human Evaluation for AI →

Human evaluation for AI encompasses structured methodologies for trained human raters to assess model output quality …

0 articles

LLM-as-a-Judge →

LLM-as-a-Judge is a method where one large language model evaluates the output of another, scoring responses for …

6 articles

MMLU Benchmark →

MMLU (Massive Multitask Language Understanding) is a benchmark that evaluates large language models across dozens of …

5 articles

Model Evaluation →

Model evaluation is the process of measuring how well a large language model performs using benchmarks, human judgment, …

7 articles

Precision Recall and F1 Score →

Precision, recall, and F1 score are classification metrics used to evaluate machine learning models. Precision measures …

6 articles

SWE-bench →

SWE-bench is a benchmark that tests AI coding agents on real bugs and feature requests pulled from popular open-source …

0 articles

Four perspectives on this domain