MAX guide 13 min read June 7, 2026 Updated July 8, 2026

How to Build an Active Learning Loop with modAL, Cleanlab, and Prodigy in 2026

Active learning loop linking query strategy, label-error detection, and human annotation stages for efficient data labeling

TL;DR

Active learning pays off only when you label what the model is uncertain about — random sampling wastes your budget on examples the model already gets right.
The loop is three jobs and three tools: modAL picks the next batch, Cleanlab catches bad labels, Prodigy puts a human in the seat. Specify the contract between them before you wire anything.
modAL still works but is effectively unmaintained — pin your versions, test against your scikit-learn, and know that scikit-activeml is the active fallback if that’s a dealbreaker.

You had a budget for 50,000 labels. You spent it. The model got better — a little. Then you looked at what you actually paid to annotate, and half of it was examples the model was already calling correctly with 99% confidence. You paid human annotators to confirm what the model already knew. That’s the failure mode Active Learning exists to kill, and most teams reintroduce it the moment they skip the part where you specify which examples are worth a human’s time.

Before You Start

You’ll need:

A Pool Based Sampling setup — a large pool of unlabeled data plus a small seed set of labeled examples to train the first model
A Scikit Learn-compatible base model (the loop is built around that interface)
Three tools installed: modAL for query strategy, Cleanlab for label-quality checks, and Prodigy for annotation
A working understanding of Uncertainty Sampling and why model confidence is the signal that drives the whole loop

This guide teaches you: how to decompose an active learning loop into four stages with clean contracts between them, so each tool does exactly one job and you can debug the loop when — not if — it stalls.

The 50,000-Label Budget You Already Wasted

Here’s what goes wrong. You take your unlabeled pool, send a random sample to annotators, train, and repeat. It feels like progress because the labeled count goes up. But random sampling spends most of your money on the easy middle of the distribution — examples the model would have classified correctly anyway. The hard cases, the ambiguous ones, the rare classes that actually move accuracy — those show up at their natural (low) frequency, which is to say almost never.

It worked fine in the demo on Friday. On Monday someone pointed at the confusion matrix and asked why the minority class still had garbage recall after 50,000 labels. The answer: the loop never specified that uncertain examples should be prioritized, so it never saw enough of them.

Step 1: Decompose the Loop Into Four Stages

Before you install anything, draw the loop. An active learning system is not one program — it’s four stages passing data to each other, and most people fail because they treat it as a single script that “does active learning.”

Your system has these parts:

The pool and seed — the unlabeled pool you draw from and the small labeled seed that trains the first model. This is a separate concern because the Cold Start Problem lives here: too small a seed and your first model’s uncertainty estimates are noise.
The query engine — modAL. It takes the current model plus the pool and returns the next batch to label. Its only job is ranking. It does not train, label, or clean.
The label-quality gate — Cleanlab. It inspects labels coming back from humans (and the seed itself) for errors, outliers, and duplicates. It interfaces between annotation and training.
The annotation surface — Prodigy. A human sits here. It receives modAL’s batch and returns labels. Its contract is input (examples to label) and output (labels), nothing more.

The Architect’s Rule: If you can’t explain the loop in four stages with one input and one output each, the AI tool you’re prompting can’t build it either. Decomposition is the spec.

modAL is built on top of scikit-learn with modular query strategies, a detail confirmed in modAL Docs — which is exactly why it slots in as a pure ranking stage without owning your model.

Step 2: Specify the Query Strategy Contract

This is where most loops are won or lost. The Query Strategy is the rule that decides which examples are “worth it.” Specify it explicitly, because the default is rarely right for your data.

Context checklist — what the query engine must know before it ranks anything:

The base model’s prediction interface — modAL needs predict_proba or a decision function to estimate uncertainty
Which strategy — pure Uncertainty Sampling, Diversity Sampling, Query By Committee, or a hybrid
Batch size — how many examples per round, which trades off retraining cost against annotator idle time
Stopping criterion — when the loop ends (accuracy plateau, budget exhausted, uncertainty floor reached)
The cold-start handling — how the first batch is chosen before the model is trustworthy

The single most important decision here: uncertainty sampling finds hard examples; diversity sampling finds the full shape of your data. Uncertainty alone will happily hand you a hundred near-identical ambiguous examples from the same cluster. That’s redundant labeling, just dressed up as work.

The Spec Test: If your query strategy is pure uncertainty and your pool has tight clusters, the AI will build a loop that asks humans to label the same confusing example a dozen times. Add a diversity term to the contract, or your annotation budget drains into one corner of the distribution.

Step 3: Sequence the Build From Seed to Loop

Order matters because each stage depends on the output of the one before it. Build out of order and you’ll be debugging a loop whose first iteration was already poisoned.

Build order:

The seed and label-quality gate first — because your seed labels train the first model, and a model trained on dirty labels produces dirty uncertainty estimates. Run Cleanlab on the seed before anything else. Cleanlab detects label errors, outliers, and near-duplicates and is model-agnostic across PyTorch, sklearn, XGBoost, and HuggingFace, per Cleanlab Docs.
The query engine next — because it depends on a trained model’s probability outputs. Wire modAL only after you have a clean seed and a fitted model.
The annotation surface last — because Prodigy consumes modAL’s ranked batch. Prodigy is built by Explosion, the makers of spaCy, and ships cloud-free with built-in active-learning recipes, according to Prodigy — which means the annotation stage and the query stage can speak the same language.

For each stage, your context must specify:

What it receives (inputs)
What it returns (outputs)
What it must NOT do (the query engine must not mutate labels; the annotation surface must not retrain)
How to handle failure (an empty pool, a tie in uncertainty scores, an annotator skip)

Every round, the new labels flow back through Cleanlab before they reach training. That gate is what keeps Training Data Quality from degrading as the loop accumulates human error.

Security & compatibility notes:
modAL (maintenance): As of mid-2026, modAL’s last release was 0.4.2 (June 2024) — roughly two years stale, with dozens of open issues and pull requests per modAL’s GitHub repository. It still installs and runs, but pin your version and test against your current scikit-learn release rather than assuming compatibility.
modAL alternative: If an unmaintained dependency is a blocker for production, scikit-activeml is an actively developed library covering the same query strategies. Swap it in at the query-engine stage — the loop contract doesn’t change.
Cleanlab (version): Use cleanlab 2.9.0 (January 2026), which requires Python 3.10+. Drop any legacy Python 3.8/3.9 examples you find in older tutorials.

Step 4: Prove the Loop Actually Saves Labels

A loop that runs is not a loop that works. You need to prove it beats the baseline, or you’ve built an expensive random sampler.

Validation checklist:

Compare against random sampling — failure looks like: your active learning curve tracks the random-sampling curve. If labeling the “uncertain” examples doesn’t beat labeling random ones, your query strategy is broken or your model’s confidence is uncalibrated.
Check label quality every round — failure looks like: accuracy climbs, then stalls or drops. That’s human label error compounding. Run Cleanlab each round and watch the flagged-error count, not just the model score.
Watch for redundant queries — failure looks like: Cleanlab’s Data Deduplication flags many near-duplicates in your labeled set. Your query strategy is over-sampling one region; add diversity.
Confirm minority-class movement — failure looks like: overall accuracy rises but per-class recall on rare classes stays flat. The loop is optimizing the easy majority. Re-weight the query strategy toward the classes you actually care about.

Four-stage active learning loop showing seed and pool, modAL query engine, Cleanlab quality gate, and Prodigy annotation with data flowing in a cycle — The four-stage loop: each tool owns one job, and labels pass through the Cleanlab gate before every retraining round.

Common Pitfalls

What You Did	Why AI Failed	The Fix
One-shot “build me active learning”	Too many concerns in one script; the AI fused query, clean, and label into a tangle you can’t debug	Decompose into four stages with explicit contracts first
Pure uncertainty sampling	AI over-samples one ambiguous cluster; budget drains into redundant labels	Add a diversity term to the query-strategy spec
Skipped the seed clean	First model trains on dirty labels; every uncertainty estimate downstream is noise	Run Cleanlab on the seed before fitting the first model
Assumed modAL is current	AI generates code against an API that may not match your sklearn version	Pin versions; test against your stack, or swap in scikit-activeml
No baseline comparison	You can’t tell if the loop beats random; you ship an expensive coin flip	Plot the active learning curve against random sampling every run

Pro Tip

Treat calibration as a precondition, not an afterthought. Every query strategy that ranks by uncertainty assumes the model’s confidence scores mean something. A model that’s 99% confident and wrong half the time will send your annotators straight to the easy examples while skipping the hard ones. Before you trust the loop, check that predicted probabilities track actual accuracy. If they don’t, calibrate first — otherwise you’ve built a loop that confidently selects the wrong work, and no amount of clever sampling fixes a lying confidence score.

Frequently Asked Questions

Q: How to build an active learning loop with modAL step by step? A: Clean your seed with Cleanlab, fit a scikit-learn model, then let modAL rank the pool by uncertainty, label the top batch in Prodigy, retrain, repeat. The step most people skip: re-run Cleanlab on returned labels every round, not just on the seed — human error accumulates and quietly poisons later uncertainty estimates.

Q: How to use active learning to cut annotation costs on a limited labeling budget? A: Spend labels only where the model is uncertain instead of sampling randomly, and stop the loop the moment the accuracy curve plateaus rather than burning the full budget. Watch out for the cold start: with too small a seed, early uncertainty scores are noise, so your first few rounds may waste labels before the signal stabilizes.

Q: How to choose between uncertainty sampling and diversity sampling for your dataset? A: Use uncertainty sampling when your data is spread out and you want the hardest cases; use diversity sampling when your pool has tight clusters that uncertainty alone would over-sample. The practical answer is usually a hybrid — uncertainty to find hard examples, diversity to keep them from all coming from the same corner of the distribution.

Your Spec Artifact

By the end of this guide, you should have:

A four-stage loop map — seed/pool, query engine, quality gate, annotation surface — with one input and one output defined per stage
A query-strategy contract specifying strategy type, batch size, stopping criterion, and cold-start handling
A validation plan that compares your loop against random sampling and tracks label quality and minority-class recall every round

Your Implementation Prompt

Drop this into Claude Code, Cursor, or your AI tool of choice once you’ve filled the brackets with your own values. It mirrors the four-stage decomposition above so the tool builds the loop you specified, not the one it guessed at.

Build an active learning loop in Python with four separate, testable stages.
Do not fuse them into one script.

STAGE 1 — SEED & QUALITY GATE
- Base model: [your scikit-learn-compatible model]
- Seed set: [size and source]
- Run Cleanlab (2.9.0, Python 3.10+) on the seed to flag label errors,
  outliers, and near-duplicates BEFORE fitting the first model.

STAGE 2 — QUERY ENGINE
- Library: modAL [pin version 0.4.2; test against my scikit-learn X.X]
  OR scikit-activeml if modAL is incompatible.
- Strategy: [uncertainty | diversity | query-by-committee | hybrid]
- Batch size: [N per round]
- Stopping criterion: [accuracy plateau | budget exhausted | uncertainty floor]
- Constraint: ranking only. Must not train or mutate labels.

STAGE 3 — ANNOTATION SURFACE
- Tool: Prodigy with an active-learning recipe.
- Input: modAL's ranked batch. Output: labels. Must not retrain.
- Every returned batch passes back through Cleanlab before training.

STAGE 4 — VALIDATION
- Plot the active learning curve against a random-sampling baseline.
- Report per-class recall, not just overall accuracy.
- Flag redundant queries via Cleanlab deduplication each round.

Edge cases to handle: empty pool, tie in uncertainty scores, annotator skip,
cold start with an undersized seed.

Ship It

You now have a mental model that turns “do active learning” into four stages with contracts you can specify, test, and debug independently. You can look at a stalled loop and know whether the problem is a dirty seed, an over-eager query strategy, or a model whose confidence is lying to you. That’s the difference between a loop that saves labels and one that just spends them in a fancier order.

Sources

modAL’s GitHub repository: modAL-python/modAL — A modular active learning framework for Python - modAL maintenance status, release history, open issues and PRs
modAL Docs: modAL: A modular active learning framework for Python3 - modAL’s scikit-learn foundation and modular query strategies
cleanlab’s PyPI page: cleanlab · PyPI - Cleanlab version 2.9.0 and Python 3.10+ requirement
Cleanlab Docs: cleanlab open-source documentation - Label-error, outlier, and duplicate detection; model-agnostic support
Prodigy: Prodigy · An annotation tool for AI, Machine Learning & NLP - Prodigy’s vendor, cloud-free design, and built-in active-learning recipes
scikit-activeml: scikit-activeml: A Comprehensive and User-friendly Active Learning Library - Actively developed alternative to modAL for query strategies

Aha Moments

MONA

Max frames the loop as four stages, and the reason that works is mathematical. Uncertainty sampling is really a search for the examples sitting closest to the model’s decision boundary — the points where a small shift in the data flips the prediction. Those points carry the most information about where the boundary should actually be, which is why labeling them moves accuracy faster than labeling the confident interior. The catch Max names — calibration — is the hidden assumption underneath all of it. A model’s confidence only points at the boundary if those confidence scores are honest. When they drift, the whole geometry the loop depends on quietly bends, and the loop starts selecting the wrong work with total conviction.

DAN

Mona’s boundary-search framing explains why this is a competitive lever, not just a tidy engineering pattern. Most teams treat labeled data as a fixed cost and scale it linearly with ambition. The teams pulling ahead are the ones who turned labeling into a targeting problem — spending the same budget on a fraction of the examples and reaching the accuracy others pay many times over to match. Max’s insistence on a baseline comparison is the part that matters in a room with a CFO: if you can show the curve beating random sampling, you’ve turned a data expense into measurable efficiency. That’s the kind of result that decides whether the next labeling budget gets approved or cut.

ALAN

Dan’s targeting framing is sharp, but it points at something uncomfortable. An active learning loop decides, automatically, which examples a human ever sees — and by extension, which ones they never do. The rare cases, the edge populations, the examples the model is wrongly confident about: those are precisely the ones uncertainty sampling can route around if calibration is off. Max’s quality gate catches some of it, but the loop is still a machine that allocates human attention, and it optimizes for the metric you hand it. So before you ship the efficiency win Dan is celebrating, ask the harder question: if the loop chooses what your annotators look at, who is accountable for what it chooses to hide?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors