MAX guide 13 min read June 19, 2026

How to Benchmark an LLM on MMLU-Pro, GPQA, and SWE-bench with lm-evaluation-harness in 2026

Routing three LLM benchmarks to the correct evaluation harness: MMLU-Pro, GPQA, and SWE-bench in 2026

Table of Contents

TL;DR

Route first: MMLU-Pro and GPQA run through lm-evaluation-harness; SWE-bench needs its own Docker harness. One command cannot do all three.
Pin everything: lm-eval v0.4.12 no longer bundles backends, GPQA’s dataset is gated, and scores drift unless you log fewshot count, prompt format, and answer extraction.
Verify before you trust: re-run for reproducibility, then sanity-check for contamination — SWE-bench Verified was retired in 2026 for exactly that reason.

You read the launch post. The new model tops MMLU-Pro, GPQA, and SWE-bench. You want to confirm the numbers yourself, so you pip install lm-eval, point it at all three tasks, and hit run. The harness chews through the first two and then errors out on the third. Different execution model. Different machine. The benchmark you most wanted to reproduce was never lm-eval’s job — and nobody told you before you burned an afternoon on it.

Before You Start

You’ll need:

An AI coding tool (Claude Code, Cursor, or Codex) to scaffold the run configuration
A GPU box or an inference API endpoint for the model under test
A Hugging Face account — one of these datasets is gated, more on that below
A working understanding of benchmark datasets and what a score actually measures

This guide teaches you: how to route each benchmark to the harness that can actually run it, and what to pin so your numbers reproduce next week.

The One-Command Benchmark Is a Myth

Three benchmarks, two completely different execution models. MMLU-Pro and GPQA are multiple-choice and generative — evaluation-harness tools like EleutherAI’s lm-evaluation-harness were built for exactly this shape of task. SWE-bench is agentic: the model has to produce a code patch that resolves a real GitHub issue, and then a separate Docker harness applies that patch and runs the repository’s own test suite. One asks a question and grades the answer. The other hands over a codebase and checks whether the build goes green.

It worked in the demo. In your pipeline, the SWE-bench task 404s because you assumed a swe_bench task existed inside lm-eval. It never did, and no version will add it. The fix is not a flag — it is understanding that you are running two harnesses, not one.

Step 1: Sort Your Benchmarks by Harness, Not by Name

Before you write a single command, decompose the job by execution model. The dataset name tells you what is being tested. The harness tells you how — and that is what determines whether your run even starts.

Your eval stack is three datasets and two harnesses:

MMLU-Pro — the mmlu_pro task in lm-eval. It spans 12,032 questions across 14 disciplines, and where the original MMLU offered four answer choices, MMLU-Pro offers ten, which collapses lucky guessing, per the MMLU-Pro paper. Multiple choice, fast, cheap.
GPQA — the gpqa_diamond_zeroshot family in lm-eval. The Diamond subset is 198 PhD-level questions in biology, physics, and chemistry, the hardest of three splits described in the GPQA paper. Generative, gated, slower.
SWE-bench — not an lm-eval task. It ships its own Docker harness, run via python -m swebench.harness.run_evaluation (or the sb-cli cloud runner, or Modal), confirmed in SWE-bench’s GitHub repository. Agentic, heavy, and the one that breaks naive pipelines.

The first two share a harness. The third does not. That single split is the whole reason the “one command” approach fails.

The Architect’s Rule: If you can’t name the harness before the dataset, you can’t reproduce the score.

One more sorting decision lives here. If you are writing a 2026 guide, SWE-bench Verified is legacy — treat its contamination-resistant successor, SWE-bench Pro, as the comparison target you actually report. Why, and the exact version details, come next.

Step 2: Lock Down the Eval Contract

Here is the part people skip, and it is the part that decides whether your number means anything. The eval contract is everything the harness needs pinned before it produces a score you can defend.

Context checklist:

lm-eval version pinned, and the correct backend extra installed — the install no longer assumes anything
Backend chosen on purpose: HuggingFace for correctness, vLLM or SGLang for throughput, the API backend for hosted models (lm-eval Docs list HuggingFace, vLLM, SGLang, OpenAI/LiteLLM, NeMo, and TensorRT-LLM)
--num_fewshot, prompt format, and answer-extraction settings recorded, because lm-eval results move when any of them change
Hugging Face authentication configured for the gated GPQA dataset
For SWE-bench: which split you are reporting (Full, Lite, or Pro) and which agent scaffold generates the patches

The Spec Test: If your run log doesn’t capture fewshot count, prompt format, and extraction settings, your score isn’t a measurement — it’s an anecdote.

Three breaking realities sit underneath that checklist. Consolidate them into one note and put it next to your run config, not in your memory.

Compatibility notes (2026):
lm-evaluation-harness v0.4.10+: Backends are no longer bundled. A plain pip install lm_eval will not run a model — you must install lm_eval[hf], lm_eval[vllm], or lm_eval[api]. Old snippets that reach for attributes like lm_eval.models.huggingface.HFLM now error, and the CLI was refactored into subcommands. Pin to v0.4.12, released 2026-05-11 (lm-eval’s releases page).
GPQA dataset is gated: Authenticate to the Hugging Face Hub before running gpqa_diamond_zeroshot, or the task fails to download with no useful score (lm-eval task README).
SWE-bench Verified is legacy: OpenAI stopped reporting it in February 2026 over contamination and now points to SWE-bench Pro (1,865 tasks, 41 repositories, multi-language). Report the successor, per OpenAI.

Step 3: Run in Dependency Order

Sequence matters because the cheap signal should fail fast and the expensive signal should run last. Build the substrate, then the light benchmarks, then the heavy one.

Build order:

Environment and backends first — install the right lm_eval[...] extra, authenticate to Hugging Face, and provision the SWE-bench host separately. That host needs real iron: at least 120 GB of free storage, 16 GB of RAM, and 8 CPU cores, per SWE-bench’s GitHub repository. Nothing runs until the substrate is correct.
MMLU-Pro and GPQA next, through lm-eval — they share one harness and one command shape (lm_eval --model hf --model_args pretrained=<model> --tasks mmlu_pro,gpqa_diamond_zeroshot --batch_size auto, per lm-eval Docs). They are fast and tell you early whether the model is in the right ballpark.
SWE-bench last — this is a two-stage job. First an agent scaffold generates patch predictions; then the swebench Docker harness grades them against each repository’s tests. The Full split is 2,294 instances across 12 Python repositories, with Verified at 500 and Lite at 300, per SWE-bench Docs. Lite or Pro is where most teams start.

For each benchmark, your context must specify:

What it receives — the model handle, the task name, the split
What it returns — accuracy for the multiple-choice tasks, resolved-issue rate for SWE-bench
What it must not do — no silent fallback to an unpinned dataset revision
How it fails — a gated download, an out-of-disk Docker run, a timeout on patch evaluation

Step 4: Prove the Numbers Are Real

A score you cannot defend is worse than no score, because it looks like evidence. Validation is two questions: does the number hold, and does it mean what you think it means.

Validation checklist:

Reproducibility — re-run each benchmark with the identical config. The scores should match. Failure looks like a different number every run, which means you left something unpinned in Step 2.
Contamination — ask whether the test questions already live in the model’s training data. Failure looks like an implausibly high score on a benchmark everyone has scraped. This is precisely where SWE-bench Verified fell: OpenAI’s audit traced roughly 32.7% solution leakage and found that 59.4% of audited failures came from flawed tests, per OpenAI. A high score there measures memorization, not capability. Contamination-resistant successors lean on fresher, sometimes synthetically generated test sets to stay ahead of benchmark contamination.
Config committed — store exact versions, fewshot count, prompt format, and extraction settings next to every score. Failure looks like a leaderboard row nobody can regenerate six months later.

Decision flow routing MMLU-Pro and GPQA to lm-evaluation-harness and SWE-bench to its own Docker harness, with pinning and validation steps — The benchmark-to-harness map: two multiple-choice tasks through lm-eval, one agentic task through its own Docker harness.

Common Pitfalls

What You Did	Why It Failed	The Fix
`pip install lm-eval` and ran a model	v0.4.10+ does not bundle backends	Install `lm_eval[hf]`, `lm_eval[vllm]`, or `lm_eval[api]`
Pointed lm-eval at SWE-bench	SWE-bench is not an lm-eval task	Run the `swebench` Docker harness instead
GPQA task failed to download	The dataset is gated on Hugging Face	Authenticate to the HF Hub before the run
Cited a SWE-bench Verified score	Deprecated in 2026 over contamination	Report SWE-bench Pro instead
Two runs, two different scores	Fewshot, prompt, and extraction unlogged	Pin and log the full eval contract

Pro Tip

Treat your eval configuration like a lockfile, not a memory. The version of lm-eval, the backend extra, the fewshot count, the prompt format, the dataset revision — all of it belongs in version control next to the result it produced. A benchmark score without its config is a screenshot of a number. A benchmark score with its config is something another engineer can rebuild from scratch, which is the only definition of a trustworthy result that survives contact with a skeptical reviewer.

Frequently Asked Questions

Q: How do you use benchmark datasets to evaluate a large language model?

A: Pick a dataset that matches the capability you care about, route it to the harness that runs it, then score the model’s outputs against the gold answers. The trap: multiple-choice accuracy and agentic resolve-rate are not interchangeable, so never average them into one number.

Q: How do you choose the right benchmark for evaluating an LLM in 2026?

A: Match the benchmark to the task. MMLU-Pro and GPQA test knowledge and reasoning under multiple choice; SWE-bench tests whether the model can ship a working patch. For coding agents in 2026, prefer SWE-bench Pro; Verified was retired over contamination.

Q: How do you run an LLM benchmark step by step with lm-evaluation-harness?

A: Install the right backend extra, authenticate to Hugging Face for gated tasks, then call lm_eval with your model and tasks. What old guides miss: since v0.4.10 a bare pip install lm-eval installs no backend, so the command has nothing to run the model with.

Your Spec Artifact

By the end of this guide, you should have:

A benchmark-to-harness map — which dataset runs in lm-eval, which runs in the SWE-bench Docker harness
A pinned eval contract — versions, backend extra, fewshot count, prompt format, dataset revision, and HF auth
A reproducibility checklist — re-run match, contamination sanity check, and the full config committed beside each score

Your Implementation Prompt

Hand this to your AI coding tool (Claude Code, Cursor, or Codex) to scaffold the run. It mirrors the four steps above — fill every bracket with your own values from the Step 2 contract.

You are setting up a reproducible LLM benchmark run for 2026. Follow this spec exactly.

CONTEXT
- Model under test: [model name + version, e.g. a HF checkpoint or an API handle]
- Compute: [GPU type and count, or inference API endpoint]
- Hugging Face token available: [yes/no — required for the gated GPQA dataset]

STEP 1 — ROUTE BY HARNESS
- MMLU-Pro and GPQA -> lm-evaluation-harness, tasks: mmlu_pro, gpqa_diamond_zeroshot
- SWE-bench -> its own Docker harness (swebench.harness.run_evaluation), NOT lm_eval
- SWE-bench split to report: [Pro | Lite | Full] (do not use the deprecated Verified split)

STEP 2 — PIN THE CONTRACT
- lm_eval version: [pin, e.g. 0.4.12]; backend extra: [hf | vllm | api]
- num_fewshot: [n]; prompt format: [default | custom]; answer extraction: [default | regex]
- Authenticate to the Hugging Face Hub before running any gated task

STEP 3 — BUILD ORDER
1. Provision env + backends; SWE-bench host needs [>=120 GB disk, >=16 GB RAM, >=8 CPU cores]
2. Run MMLU-Pro + GPQA through lm_eval
3. Generate SWE-bench patches with [agent scaffold], then grade with the swebench harness

STEP 4 — VALIDATE
- Re-run each benchmark with the identical config; the scores must match
- Commit the full config (versions, fewshot, prompt, extraction) next to every score
- Flag any score that looks too high as a possible contamination signal

Output: a runnable project layout and the exact commands per step. Do not collapse the two harnesses into one.

Ship It

You now have a mental model that separates the dataset from the harness, which is the difference between a number you reported and a number you can defend. Route by execution model, pin the contract, run in dependency order, and validate before you trust. Do that, and the next launch-post claim becomes something you can reproduce in an afternoon instead of argue about for a week.

Deploy safe, Max.

Sources

lm-eval’s releases page: Releases · EleutherAI/lm-evaluation-harness - Latest version and the v0.4.10+ breaking change to backend installs
lm-eval Docs: EleutherAI/lm-evaluation-harness - Install extras, supported backends, and run command shape
lm-eval task README: lm_eval/tasks/gpqa/README.md - GPQA task names and the gated-dataset requirement
MMLU-Pro paper: MMLU-Pro: A More Robust and Challenging Multi-Task Benchmark - Question count, disciplines, and ten-option format
GPQA paper: GPQA: A Graduate-Level Google-Proof Q&A Benchmark - Diamond/Main/Extended subset sizes and difficulty
SWE-bench Docs: SWE-bench Leaderboards and dataset sizes - Full, Verified, and Lite split sizes
SWE-bench’s GitHub repository: SWE-bench/SWE-bench - Docker evaluation harness and hardware requirements
OpenAI: Why we no longer evaluate SWE-bench Verified - Verified deprecation and contamination findings

Aha Moments

MONA

A benchmark score is a measurement, and every measurement carries error bars Max is right to insist on logging. MMLU-Pro and GPQA probe whether a model can select a correct answer from a fixed set — they sample a narrow slice of reasoning under tightly controlled conditions. SWE-bench asks something categorically different: can the model act, produce a patch, and survive a test suite. Collapsing both into a single leaderboard row hides that they measure different capabilities entirely. The number you publish is a point estimate, not a fixed property of the model. Re-run it and it shifts. That movement is not noise to be smoothed away — it is the honest width of what you actually know.

DAN

Mona calls the score a point estimate. The market treats it as truth, and that gap is exactly where buyers get burned. Every vendor now leads with benchmark numbers, which means the numbers stopped being a differentiator and started being the price of entry. The teams pulling ahead are the ones publishing reproducible configs alongside the score, because a result a buyer can rebuild earns trust faster than a bigger headline. Specification-driven evaluation is quietly becoming a procurement filter: if you cannot reproduce the claim, the claim is marketing. Expect serious buyers to start demanding the run config before they sign anything, and expect the unreproducible scores to be the first ones a competitor discredits.

ALAN

Dan frames reproducibility as a procurement filter. I would push further: it is also the only thing standing between us and benchmarks that quietly grade themselves. Contamination is the uncomfortable case — when the test questions have already leaked into training data, a high score measures memorization, not capability, and almost everyone in the chain has an incentive not to look too closely. OpenAI retiring a contaminated benchmark was the rare moment someone chose to look anyway. Most teams will not. The deeper risk is that we hang hiring, funding, and deployment decisions on a single reproducible-looking number while never auditing the dataset beneath it. So before you cite the next state-of-the-art result: do you actually know what is inside the benchmark, or only what the leaderboard told you?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors