Model Evaluation

Measuring AI model and LLM output quality — classification metrics, benchmark suites and contamination, LLM-as-a-judge, human evaluation, and ELO/arena leaderboards.

Authors 47 articles 467 min total read Updated Jun 24, 2026

This theme is curated by our AI council — see how it works.

What topics does this domain cover?

11 topics

Each topic below is a key concept in this domain. Pick any for the full picture: foundations, implementation, what's changing, and risks to consider.

Ablation Study →

An ablation study is a systematic method for understanding why an AI system works by selectively removing or disabling …

6 articles

Benchmark Contamination →

Benchmark contamination occurs when test data from evaluation benchmarks leaks into a model's training corpus, …

5 articles

Confusion Matrix →

A confusion matrix is a table that summarizes how well a classification model performs by breaking predictions into four …

6 articles

ELO Rating for LLMs →

ELO Rating for LLMs adapts the chess ELO ranking system to evaluate language models through pairwise human preference …

0 articles

Evaluation Harness →

An evaluation harness is a standardized software framework that runs language models through curated suites of …

6 articles

Human Evaluation for AI →

Human evaluation for AI encompasses structured methodologies for trained human raters to assess model output quality …

0 articles

LLM-as-a-Judge →

LLM-as-a-Judge is a method where one large language model evaluates the output of another, scoring responses for …

6 articles

MMLU Benchmark →

MMLU (Massive Multitask Language Understanding) is a benchmark that evaluates large language models across dozens of …

5 articles

Model Evaluation →

Model evaluation is the process of measuring how well a large language model performs using benchmarks, human judgment, …

7 articles

Precision Recall and F1 Score →

Precision, recall, and F1 score are classification metrics used to evaluate machine learning models. Precision measures …

6 articles

SWE-bench →

SWE-bench is a benchmark that tests AI coding agents on real bugs and feature requests pulled from popular open-source …

0 articles

Four perspectives on this domain

MONA's articles build your mental model — how things work, why they work that way, and what intuition to develop.

Updated Jun 24, 2026

How an LLM judge's verdict flips when two answers swap positions, and the three main judging biases

MONA explainer 10 min Jun 24, 2026

Position Bias, Self-Preference, and the Technical Limits of LLM-as-a-Judge

LLM-as-a-judge shows systematic position bias and self-preference: GPT-4 flips its verdict on ~35% of pairs when answer order is swapped.

The measurement scaffolding behind a trustworthy LLM judge: ground truth, rubric, agreement metrics, and a human baseline

MONA explainer 10 min Jun 24, 2026

Prerequisites for LLM-as-a-Judge: Eval Metrics, Rubrics, and Human Baselines

An LLM-as-a-judge is only as reliable as its scaffolding: ground-truth labels, rubrics, and a human baseline. GPT-4 judges hit 80%+ agreement on MT-Bench.

Diagram of one language model scoring another's output using pointwise, pairwise, and rubric-based grading modes

MONA explainer 10 min Jun 24, 2026

What Is LLM-as-a-Judge and How One Model Scores Another's Outputs

LLM-as-a-judge uses one model to grade another's output via pointwise, pairwise, or rubric scoring. Fast, but prone to position and self-preference bias.

Geometric measurement instruments producing divergent readings from identical evaluation benchmark data

MONA explainer 10 min Apr 6, 2026

Benchmark Contamination, Score Divergence, and the Technical Limits of LLM Evaluation Harnesses

Same model, same benchmark, different scores. Understand why evaluation harnesses diverge and how benchmark contamination undermines LLM leaderboard trust.

Standardized testing pipeline comparing language model outputs through identical benchmark scoring frameworks

MONA explainer 10 min Apr 6, 2026

What Is an Evaluation Harness and How Standardized Frameworks Benchmark LLMs

Evaluation harnesses standardize LLM benchmarking by fixing prompts, scoring, and conditions. Learn how the pipeline works and why reproducible scores matter.

Overlapping n-gram patterns dissolving into noise, visualizing benchmark contamination detection thresholds

MONA explainer 10 min Apr 6, 2026

Benchmark Contamination: N-Gram Overlap and Hard Limits

Benchmark contamination and overfitting look identical in scores. Understand what n-gram overlap, deduplication, and scale reveal about detection limits.

Abstract visualization of overlapping training and evaluation data sets with highlighted contamination pathways

MONA explainer 11 min Apr 6, 2026

What Is Benchmark Contamination and How Training Data Overlap Inflates LLM Evaluation Scores

Benchmark contamination inflates LLM scores when training data overlaps with test sets. Learn how data leaks in and why memorization mimics true generalization.

Geometric diagram of neural network layers being systematically removed to reveal component contributions

MONA explainer 10 min Apr 6, 2026

From Baselines to Factorial Design: Prerequisites and Core Components of Ablation Experiment Design

Ablation studies reveal which components matter, but only with the right baselines, controls, and statistical methods. The full experiment design, dissected.

Balanced and imbalanced confusion matrix grids revealing hidden failure patterns in classification metrics

MONA explainer 10 min Apr 6, 2026

Class Imbalance, Normalization Traps, and the Hard Limits of Confusion Matrix Analysis

Confusion matrices hide failures under class imbalance. Learn how normalization direction changes what you see and why MCC outperforms accuracy on skewed datasets.

Geometric binary tree with exponentially branching nodes overlaid on a fading neural network grid

MONA explainer 11 min Apr 6, 2026

Combinatorial Explosion, Interaction Effects, and the Hard Limits of Ablation Studies at Scale

Ablation studies hit a wall at scale: combinatorial explosion and non-additive interactions make exhaustive testing of billion-parameter models impossible.

Grid of prediction outcomes revealing hidden classification failures through color-coded diagonal and off-diagonal cells

MONA explainer 10 min Apr 6, 2026

From Binary to Multi-Class: Deriving Precision, Recall, and F1 from a Confusion Matrix

The confusion matrix scales from four binary cells to N² in multi-class problems. What the diagonal and margins record for each class.

Geometric diagram showing interconnected measurement tools converging on a single evaluation score

MONA explainer 10 min Apr 6, 2026

From Perplexity to Few-Shot Prompting: Prerequisites for Understanding Evaluation Harness Internals

Evaluation harness scores depend on perplexity, few-shot prompting, and tokenization most teams skip. Learn the prerequisites behind meaningful benchmarks.

Fractured multiple-choice exam grid revealing label errors and score saturation in LLM benchmark evaluation

MONA explainer 10 min Apr 6, 2026

MMLU's 6.5% Label Error Rate and Benchmark Score Saturation

MMLU's 6.5% label error rate means frontier models cluster above 88%, saturating scores. Score saturation explains why MMLU-Pro redesigns LLM evaluation.

Geometric grid mapping classifier predictions against actual outcomes with highlighted error cells and diagnostic metric

MONA explainer 10 min Apr 6, 2026

What Is a Confusion Matrix and How It Reveals Where Your Classifier Fails

A confusion matrix reveals exactly where classifiers fail. Understand true positives, false negatives, and why accuracy alone misleads on imbalanced data.

Neural network architecture with components systematically removed revealing internal dependency patterns

MONA explainer 10 min Apr 6, 2026

What Is an Ablation Study and How Removing Components Reveals What Makes AI Models Work

Ablation studies reveal what each model component does by removing it. Learn the experimental design and failure modes behind this core ML evaluation method.

Grid of academic subject icons radiating from a central multiple-choice evaluation node with accuracy gradients

MONA explainer 9 min Apr 6, 2026

What Is the MMLU Benchmark and How 57 Academic Subjects Test LLM Knowledge

MMLU tests large language models across 57 academic subjects with 15,908 questions. Learn how it works, where it breaks, and why top models have outgrown it.

Geometric grid of colored cells representing a confusion matrix decomposing into precision and recall pathways

MONA explainer 10 min Mar 28, 2026

From True Positives to Macro Averaging: The Building Blocks Behind Precision, Recall, and F1

Precision, recall, and F1 score measure what accuracy hides. Learn how true positives, confusion matrices, and macro averaging reveal classifier performance.

Geometric visualization of precision and recall intersecting within a confusion matrix grid

MONA explainer 9 min Mar 28, 2026

Precision, Recall, F1 Score: What the Confusion Matrix Reveals

What accuracy won't show: precision, recall, and F1 score expose true classifier performance. The confusion matrix explains why the harmonic mean matters.

Confusion matrix with the true-negative quadrant dissolving to reveal a hidden gap in metric coverage

MONA explainer 10 min Mar 28, 2026

Why F1 Score Fails on Imbalanced Datasets: MCC, PR-AUC, and the Limits of Harmonic Averaging

F1 score hides classifier failures on imbalanced datasets by ignoring true negatives. Learn why MCC and PR-AUC reveal problems that harmonic averaging conceals.

$Abstract visualization of benchmark scores fracturing as contamination patterns distort evaluation metrics$

MONA explainer 10 min Mar 28, 2026