MAX guide 12 min read April 6, 2026

How to Benchmark LLMs with lm-evaluation-harness, HELM, and OpenCompass in 2026

Q: How to set up and run lm-evaluation-harness to benchmark a language model step by step?

Install with pip install lm-eval (requires Python 3.10+), then run a single command specifying model, task, and batch size. The step most tutorials skip: pin your few-shot count and random seed in the command arguments before comparing results. Default parameters shift between harness versions, so explicit configuration prevents silent score changes that invalidate your comparisons across updates.

Engineer reviewing benchmark comparison dashboards across multiple LLM evaluation frameworks

Table of Contents

TL;DR

Define your evaluation criteria before picking a harness — the tool comes after the spec
lm-evaluation-harness, HELM, and OpenCompass solve different problems — match the harness to your question
Benchmark scores without contamination checks and metric alignment are noise, not signal

You picked a model last quarter. The Evaluation Harness score looked strong — top-five on the leaderboard, solid reasoning numbers, clean accuracy. Three weeks into production, the model started hallucinating API schemas and formatting every response as a numbered list. The benchmark said one thing. Your users said another.

The benchmark wasn’t wrong. Your evaluation spec was missing.

Before You Start

You’ll need:

A clear question about your model (not “is it good?” — something specific)
Access to one of: lm-evaluation-harness, HELM, or OpenCompass
Basic familiarity with Model Evaluation concepts and Precision, Recall, and F1 Score metrics
Python 3.10+ (required by lm-evaluation-harness and HELM; OpenCompass requires 3.8+ per OpenCompass PyPI)

This guide teaches you: how to decompose your evaluation question into a specification the harness can answer — and how to read the results without fooling yourself.

The Leaderboard Score That Meant Nothing

Here’s what happens every month. Team evaluates three candidate models. Team picks the one with the highest MMLU score. Team ships it. Three sprints later, the model fails on domain-specific reasoning — the exact task they needed it for.

The score was real. It just measured the wrong thing.

The Open LLM Leaderboard — which used lm-evaluation-harness as its backend — was retired in March 2025 (Hugging Face). Over 200 community leaderboards have since appeared on Hugging Face, each measuring different things. No single benchmark represents your production workload. If you don’t specify what “good” means for your use case, the harness will measure whatever its defaults say — and you’ll ship on someone else’s criteria.

Step 1: Decompose Your Evaluation Question

Before you touch a harness, answer three questions.

What capability am I testing? Reasoning? Instruction following? Code generation? Factual recall? Each maps to different benchmark tasks. A model that aces HellaSwag may still fail IFEval.

What does failure look like in my system? If your model generates API calls, failure is a malformed schema. If it writes summaries, failure is a missed key fact. Your failure mode determines your metrics — accuracy alone won’t catch format violations.

What’s my comparison baseline? Are you comparing two fine-tuned checkpoints? An open model against a commercial API? A model before and after quantization? Each comparison needs controlled conditions.

Your evaluation has these components:

Capability scope — the specific tasks your model must perform
Metric set — measurements that map to your failure modes (not just accuracy — think Confusion Matrix metrics, format compliance, latency)
Comparison frame — what you’re measuring against and under what conditions

The Architect’s Rule: If you can’t state your evaluation question in one sentence, the harness can’t answer it either.

Step 2: Match the Harness to the Question

Three harnesses. Three different strengths. Pick wrong and you’ll spend a week configuring something that doesn’t measure what you need.

lm-evaluation-harness (v0.4.11, EleutherAI PyPI) is the workhorse. Sixty-plus standard benchmarks, hundreds of subtasks. Supports HuggingFace, vLLM, SGLang, NeMo, OpenVINO, and API backends including OpenAI and TextSynth (EleutherAI GitHub). Best for: standardized academic benchmarks and direct model-to-model comparison on established tasks. Install with pip install lm-eval.

Helm Benchmark (v0.5.14, Stanford CRFM PyPI) takes a wider lens. Where lm-evaluation-harness focuses on task accuracy, HELM evaluates across multiple dimensions — accuracy, bias, toxicity, efficiency — in a single run. It ships with domain-specific variants: MedHELM for medicine, VHELM for vision-language, HEIM for text-to-image (Stanford CRFM GitHub). Best for: multi-dimensional evaluation where you care about more than raw performance. Install with pip install crfm-helm.

OpenCompass (v0.5.2, OpenCompass GitHub) covers the most ground. Seventy-plus datasets, roughly 400,000 evaluation questions, and 20-plus pre-configured model profiles including Llama, Qwen, InternLM, and API models for OpenAI, Claude, and Gemini. The 2.0 release added CompassKit for evaluation toolkits, CompassHub for browsing benchmarks, and CompassRank for leaderboards. Best for: broad coverage evaluations and teams working with diverse model families. Install with pip install -U opencompass.

Note: OpenCompass deprecated its built-in multi-modality evaluation in April 2026 and moved that capability to VLMEvalKit. If you need vision-language benchmarks, use VHELM or VLMEvalKit instead.

Quick decision matrix:

Your Question	Best Harness	Why
“How does my model score on standard benchmarks?”	lm-evaluation-harness	Deepest coverage of academic tasks, fastest setup
“Is my model safe AND accurate?”	HELM	Multi-metric approach includes bias and toxicity
“How does my model compare across dozens of benchmarks at once?”	OpenCompass	Largest benchmark collection, strong multi-GPU support
“Does my RAG pipeline produce accurate answers?”	Neither — use Deepeval	Application-level testing with LLM-as-judge metrics
“Is my model safe for deployment?”	Inspect AI	UK AISI framework with 100+ pre-built safety evals

Step 3: Configure Your Evaluation Pipeline

Right harness chosen. Now lock down the specification before you run anything.

Context checklist — specify ALL of these:

Model identifier — exact model path or API endpoint, not “the latest version”
Task set — which benchmarks, which subtasks, how many few-shot examples
Batch size and hardware — results vary with batch size; pin it for reproducibility
Output format — where results land, what format, what metadata to capture
Random seed — set it; reproducibility isn’t optional

Build order for each harness:

Install and verify — confirm the harness runs on a trivial task before configuring your full evaluation
Select task suite — pick benchmark tasks that map to the capabilities you identified in Step 1
Configure model backend — connect the harness to your model (local weights, vLLM endpoint, or API)
Run and capture — execute the evaluation with logging enabled

For HELM, the workflow is three separate commands: helm-run to execute, helm-summarize to aggregate, helm-server to view results in a web dashboard at localhost:8000 (Stanford CRFM GitHub). That separation is a design choice — it lets you re-summarize without re-running expensive evaluations.

The Spec Test: If your evaluation config doesn’t specify the exact few-shot count, the harness will pick a default. That default may not match what the leaderboard used. Your “comparison” just became meaningless.

Step 4: Read the Results Without Fooling Yourself

Numbers are in. Now prove they mean something.

Validation checklist:

Benchmark Contamination check — did your training data include the benchmark questions? If yes, your accuracy score is memorization, not capability. Failure looks like: suspiciously high scores on well-known benchmarks paired with poor real-world performance.
Metric alignment — do the metrics you measured actually map to your production failure modes? Failure looks like: high accuracy but users still reporting bad outputs, because accuracy doesn’t capture format compliance or tone.
Reproducibility — run the same evaluation twice. Same results? If not, check batch size, random seed, and hardware configuration. Failure looks like: score differences between runs that exceed the actual difference between models.
Baseline sanity — does the score ordering match your qualitative experience? If your best model scores lowest, either your evaluation spec is wrong or your intuition needs updating. Investigate before shipping.

The one thing most teams skip: comparing across harnesses. lm-evaluation-harness and HELM may report different scores for the same model on the same benchmark — different prompt templates, different few-shot formats, different metric implementations. Run your top candidate through two harnesses. If the rankings hold, you have signal. If they flip, you have more work to do.

Four-step evaluation workflow from question decomposition through harness selection, configuration, and result validation — The spec-first evaluation pipeline: decompose your question, match the harness, configure precisely, then validate the results actually matter.

Common Pitfalls

What You Did	Why the Evaluation Failed	The Fix
Picked the harness first	Measured available benchmarks, not your capabilities	Define evaluation question before selecting tools
Used leaderboard defaults	Few-shot count and prompt template didn’t match your comparison	Pin every parameter in your evaluation config
Skipped contamination check	Training data included benchmark questions	Run contamination detection or use held-out tasks
Compared scores across harnesses	Different harnesses use different scoring methods	Compare within the same harness, same config
Ignored multi-metric results	Optimized for accuracy, missed bias or format issues	Use HELM’s multi-dimensional reports for safety-critical models

Pro Tip

Your evaluation spec is a living document. Every time you fine-tune, swap the prompt template, or change the retrieval layer, re-run the evaluation with the same config. One benchmark run means nothing. A trendline across iterations tells you whether your system is improving or drifting. The power of a pinned spec is that it gives you a consistent lens — and consistency turns numbers into engineering data.

Frequently Asked Questions

Q: How to set up and run lm-evaluation-harness to benchmark a language model step by step? A: Install with pip install lm-eval (requires Python 3.10+), then run a single command specifying model, task, and batch size. The step most tutorials skip: pin your few-shot count and random seed in the command arguments before comparing results. Default parameters shift between harness versions, so explicit configuration prevents silent score changes that invalidate your comparisons across updates.

Q: How to choose between lm-evaluation-harness, HELM, and OpenCompass for model evaluation? A: Start with what you need to measure. lm-evaluation-harness wins on standard academic benchmarks and setup speed. HELM wins when you need safety, bias, and efficiency metrics alongside accuracy. OpenCompass wins on breadth — the largest dataset collection and pre-configured profiles for the widest model range. If you need application-level testing for RAG or agents, none of these three apply — look at DeepEval or Inspect AI instead.

Q: How to use an evaluation harness to compare open-source LLMs before deployment? A: Run every candidate through the same harness, same task set, same config — batch size and random seed included. The mistake that burns teams: comparing Model A’s lm-evaluation-harness score against Model B’s HELM score. Different harnesses use different prompt templates and scoring. Pick one harness, pin every parameter, then compare. Add a contamination check to confirm high scores reflect capability, not memorized benchmark answers.

Your Spec Artifact

By the end of this guide, you should have:

Evaluation question spec — one sentence stating what capability you’re measuring and what failure looks like
Harness selection rationale — which tool you chose and why it matches your question
Pinned evaluation config — model ID, task set, few-shot count, batch size, random seed, output path — every parameter explicit

Your Implementation Prompt

Use this prompt in Claude Code or Cursor to generate your evaluation pipeline configuration. Fill in the bracketed placeholders with the values from your Steps 1-4 work.

I need to set up an LLM evaluation pipeline. Here is my specification:

EVALUATION QUESTION:
- Capability being tested: [specific capability from Step 1 — e.g., instruction following, code generation, factual recall]
- Failure mode in my system: [what bad output looks like — e.g., malformed JSON, missed key facts, wrong API schemas]
- Comparison: [what I'm comparing — e.g., two fine-tuned checkpoints, open vs. commercial model, pre- vs. post-quantization]

HARNESS: [lm-evaluation-harness | HELM | OpenCompass]

CONFIGURATION:
- Model identifier: [exact model path or API endpoint]
- Task set: [specific benchmark tasks that map to my capability — e.g., IFEval for instruction following, HumanEval for code]
- Few-shot count: [exact number — e.g., 5-shot]
- Batch size: [pinned value — e.g., 8]
- Random seed: [fixed seed — e.g., 42]
- Output path: [where to store results]
- Hardware: [GPU type and count]

VALIDATION REQUIREMENTS:
- Run contamination check: [yes/no — specify method if yes]
- Multi-metric evaluation needed: [yes/no — if yes, list dimensions beyond accuracy: bias, toxicity, efficiency]
- Reproducibility: run [N] times and compare score variance

Generate the evaluation script, output parsing logic, and a comparison report template. Flag any conflicts between my configuration and the harness defaults.

Ship It

You now have a framework for turning “which model is better?” into a question with a verifiable answer. The harness is a tool. The spec is the strategy. Get the spec right, and the benchmarks stop being leaderboard theater and start being engineering data you can act on.

Aha Moments

MONA

The engineering instinct is right — you cannot improve what you cannot measure — but the measurement itself is more fragile than most teams realize. Evaluation harnesses apply scoring functions to model outputs, and those scoring functions embed assumptions about what correct looks like. When lm-evaluation-harness and HELM report different scores for the same model on the same benchmark, the divergence usually traces to prompt template construction and output parsing logic, not the model itself. The harness is not a neutral observer. It is an instrument with its own tolerances and biases. Treating any benchmark score as ground truth, rather than one measurement from one instrument under one set of conditions, is where evaluation efforts go sideways.

DAN

The retirement of the Open LLM Leaderboard reshaped how organizations approach model selection. Instead of one centralized ranking, teams now face a fragmented evaluation market — each framework measuring different dimensions under different conditions. That fragmentation is an advantage for teams that spec properly. When you define evaluation criteria before looking at any leaderboard, the proliferation of harnesses becomes a toolkit rather than a source of confusion. The teams pulling ahead are the ones treating evaluation as an internal engineering discipline, not an external stamp of approval they borrow from someone else’s benchmark suite.

ALAN

Both of you assume the evaluation question is well-formed. What concerns me is the upstream problem: who decides what “good” means, and whose failure modes get specified? A model optimized for standard academic benchmarks may score beautifully while failing on tasks specific to underrepresented domains or languages. The harness dutifully measures what you ask it to measure — it cannot flag what you forgot to ask. Before specifying which tool to run, the harder question is whether the evaluation criteria themselves contain blind spots that no configuration file and no validation checklist can reveal?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors