MAX guide 12 min read May 31, 2026 Updated July 8, 2026

How to Build a Training Data Quality Pipeline with Cleanlab, Snorkel, and Lightly in 2026

Q: How to use Cleanlab to find label errors and outliers?

Use the Datalab class as your single entry point — it detects label errors, outliers, duplicates, and class imbalance across text, image, audio, and tabular data. One tip: Cleanlab is model-agnostic, so the quality of your flagged list depends on the predicted probabilities you feed it.

Diagram of a training data quality pipeline: curate samples, label with weak supervision, then audit labels for errors

TL;DR

A data quality pipeline is three separate jobs — curate what goes in, label it, and audit the labels. Build them as distinct stages, not one script.
Define what “a label error” means for your project before you run any tool. The tools detect what you specify, not what you hope.
Treat the audit as a measurement, not a vibe. You should be able to show that label quality went up after a pass, with numbers you can point at.

You fine-tune a classifier. Accuracy looks fine in your notebook. You ship it. Two weeks later support tickets pile up because the model confidently labels the wrong thing — and when you trace it back, the same mistake is sitting in your training set, repeated hundreds of times. Nobody audited the labels. The model didn’t fail. It learned exactly what you gave it.

That’s the pattern I see most. the data is the bug, and there’s no stack trace for data.

Before You Start

You’ll need:

An AI coding tool you already use (Claude Code, Cursor, or Codex) to scaffold the pipeline stages
A working understanding of Training Data Quality and where Label Noise comes from
A clear picture of one dataset you actually want to clean — not a hypothetical one

This guide teaches you: how to decompose data quality into three independent stages — curation, labeling, and auditing — so each tool does one job well instead of one script trying to do everything.

The Model That Learned Your Mistakes

Most “the model is dumb” reports are really “the labels were wrong” reports. A handful of mislabeled examples in a small training set bends the decision boundary, and the model dutifully reproduces your error at inference time. This is the whole premise of Data-Centric AI: you get more out of fixing the data than out of swapping architectures.

It passed eval on Friday. On Monday the same model started misclassifying a whole category — because the new batch of labels came from a different annotator who interpreted the guideline differently, and nobody checked the labels against the old ones before training.

Step 1: Separate the Three Jobs Your Pipeline Actually Does

People reach for “a data cleaning script” and end up with one tangled file that selects samples, generates labels, and tries to catch errors all at once. When it breaks, you can’t tell which job broke. Decompose first.

Your pipeline has these parts:

Curation — decides which raw samples are worth keeping and labeling. This is where you control diversity, dedup, and Data Provenance. Owned by Lightly.
Labeling — produces labels at scale when you don’t have enough hand-labeled data. This is programmatic Weak Supervision. Owned by Snorkel.
Auditing — finds the labels that are wrong, the samples that are outliers, and the near-duplicates that inflate your metrics. Owned by Cleanlab.

Each stage has a different input and a different output. Curation takes raw data, returns a selected subset. Labeling takes the subset, returns labels. Auditing takes labeled data, returns a ranked list of suspect examples. Three contracts. Three failure modes you can isolate.

The Architect’s Rule: If you can’t explain the pipeline in three layers, the AI can’t build it either — and you can’t debug it either.

Step 2: Define What “Clean” Means Before You Touch the Data

Here’s the step everyone skips. You can’t detect a label error until you’ve written down what a correct label is. The tools are precise. Your definition has to be too.

Context checklist — specify these before the first run:

Label schema and edge cases — the exact classes, and the rule for ambiguous samples that sit between two of them
Duplicate policy — what counts as a duplicate. Exact match? Near-match above a similarity threshold? This drives your Data Deduplication rule
Outlier tolerance — how far from the distribution a sample can sit before it’s flagged for review instead of trained on
Class balance target — whether you’re correcting for Class Imbalance or preserving the real-world ratio
Stack and versions — pin them, because these tools have real version constraints (see the note below)

The Spec Test: If your duplicate policy doesn’t define a similarity threshold, the AI will pick one for you — usually exact-match — and your near-duplicate leakage between train and test sails straight through, inflating accuracy you don’t actually have.

Version & compatibility notes:
Cleanlab: Use 2.9.0 (released January 13, 2026), which requires Python 3.10 or newer (cleanlab PyPI).
Snorkel (open-source): Latest is 0.10.0 (February 2024), Python 3.11+. The open-source package is in low-maintenance mode — the team’s active development moved to the commercial Snorkel Flow platform (snorkel PyPI). Expect slow updates; pin your version and don’t wait on new OSS features.
Lightly: Use 1.5.24 (released May 2026). Pin your project to Python 3.12 or lower — Python 3.13 is not yet supported (lightly PyPI). Note that LightlyOne is being superseded by LightlyStudio for new curation work.

Step 3: Sequence the Pipeline — Curate, Label, Audit

Order matters, because each stage feeds the next. Get it wrong and you’ll spend labeling budget on samples you should have thrown away.

Build order:

Curate first, with Lightly — because labeling and auditing both cost time you don’t want to spend on redundant or low-value samples. Lightly’s approach uses self-supervised embeddings plus active-learning selection to pick a diverse, informative subset from a large pool, which is its core curation method (Lightly Blog). For vision datasets especially, this is where you kill redundancy before it costs you anything downstream.
Label next, with Snorkel — because once you’ve chosen what to keep, you need labels for it. Snorkel’s mechanism is labeling functions: small programmatic heuristics that each vote a label or abstain, combined into probabilistic labels by its LabelModel (snorkel PyPI). Use the maintained LabelModel API — older MeTaL and DryBell names are research artifacts, not current packages.
Audit last, with Cleanlab — because auditing only makes sense once labels exist. Cleanlab runs Confident Learning to find label errors, outliers, duplicates, and class-imbalance issues, and its Datalab class is the single entry point across text, image, audio, and tabular data (Cleanlab Docs). It’s model-agnostic — it works from the predicted probabilities of whatever model you already have.

For each stage, your context must specify:

What it receives (raw pool / selected subset / labeled set)
What it returns (subset / labels / ranked suspect list)
What it must NOT do (curation must not relabel; auditing must not silently drop samples)
How to handle failure (what happens to abstained labels, to flagged outliers, to ties)

The output of stage three usually loops back: the suspect list from Cleanlab tells you which labels to fix or which labeling functions to rewrite. A data quality pipeline is a loop, not a one-shot.

Step 4: Prove the Data Got Better

“It looks cleaner” is not a result. You specified what clean means in Step 2 — now measure against it.

Validation checklist:

Label-error count dropped — failure looks like: you fixed flagged labels but the re-run surfaces the same count, meaning your fixes introduced new errors
Train/test duplicate leakage is zero — failure looks like: suspiciously high eval accuracy that collapses on a truly held-out set
Outlier set was reviewed, not deleted — failure looks like: your rare-but-valid edge cases vanished, and the model now fails on exactly those in production
Label agreement improved — failure looks like: the LabelModel’s confidence stayed flat, meaning your labeling functions conflict more than they agree

Three-stage training data quality pipeline showing curation feeding labeling, labeling feeding auditing, and the audit looping back to fix labels — The pipeline is a loop: the audit stage feeds suspect labels back to curation and labeling.

Common Pitfalls

What You Did	Why AI Failed	The Fix
One “clean my data” script	Curation, labeling, and auditing have different inputs — one function picks the wrong priority	Decompose into three stages with explicit contracts
No duplicate threshold specified	Tool defaults to exact-match, near-duplicates leak between train and test	Define a similarity threshold in Step 2
Deleted every flagged outlier	Rare-but-valid edge cases get thrown out, model fails on them in production	Route outliers to review, not to the trash
Audited before curating	Spent labeling and audit time on redundant samples	Curate first — never audit data you’d have discarded

Pro Tip

Treat the suspect list as a spec change, not a cleanup chore. When Cleanlab flags the same class of error repeatedly, that’s not noise to delete — it’s a labeling guideline that was never written down. Fix the guideline, regenerate the labels, and the whole class of error disappears at the source. The best data quality work edits the rules upstream, not the rows downstream.

Frequently Asked Questions

Q: How to detect mislabeled data in a training set? A: Run a model to get predicted probabilities, then feed them to Cleanlab’s confident learning, which ranks examples by how likely each label is wrong. Watch the threshold: a strict cutoff misses subtle errors, a loose one floods you with false positives you’ll have to triage by hand.

Q: How to build a data cleaning and curation pipeline step by step? A: Decompose into three stages — curate (Lightly), label (Snorkel), audit (Cleanlab) — each with its own input and output contract. The detail most people miss: make it a loop. The audit stage should feed corrections back to labeling, not end the pipeline.

Q: How to use Cleanlab to find label errors and outliers? A: Use the Datalab class as your single entry point — it detects label errors, outliers, duplicates, and class imbalance across text, image, audio, and tabular data. One tip: Cleanlab is model-agnostic, so the quality of your flagged list depends on the predicted probabilities you feed it.

Q: How to curate a training dataset with active learning using Lightly? A: Lightly builds self-supervised embeddings, then uses active-learning selection to pick a diverse, informative subset instead of random sampling. Pin to Python 3.12 or lower for now, and for new projects start on LightlyStudio rather than the older LightlyOne.

Your Spec Artifact

By the end of this guide, you should have:

A three-stage pipeline map — curate, label, audit — with the input and output written for each stage
A “what clean means” constraint list — label schema, duplicate threshold, outlier tolerance, class-balance target
A validation checklist with a named failure symptom for each check, so a passing run is a measured result, not a feeling

Your Implementation Prompt

Paste this into your AI coding tool to scaffold the three pipeline stages. Fill every bracket with the values you defined in Step 2 — the prompt is built around your decomposition, not a generic template.

Build a training data quality pipeline as three independent stages. Do not
merge them into one function — each stage has its own input and output contract.

STAGE 1 — CURATION (Lightly, pin Python <=3.12):
  Input: raw sample pool at [path to raw data]
  Output: a selected subset
  Method: self-supervised embeddings + active-learning selection
  Constraint: must NOT relabel; selection only
  Duplicate policy: flag pairs above similarity threshold [your threshold]

STAGE 2 — LABELING (Snorkel OSS 0.10.0, use the maintained LabelModel API):
  Input: the curated subset
  Output: probabilistic labels
  Label schema: [your exact classes]
  Ambiguous-sample rule: [your rule for samples between two classes]
  Constraint: abstained labels must be routed to [review queue], not dropped

STAGE 3 — AUDITING (Cleanlab 2.9.0, Python >=3.10, via the Datalab class):
  Input: the labeled set + model predicted probabilities
  Output: a ranked list of suspect examples
  Detect: label errors, outliers, duplicates, class imbalance
  Outlier tolerance: [how far from the distribution before flag-for-review]
  Constraint: must NOT silently delete flagged samples

VALIDATION (run after a full pass):
  - Confirm label-error count dropped on re-run
  - Confirm zero train/test duplicate leakage
  - Confirm flagged outliers were reviewed, not deleted
  - Loop suspect list back to Stage 2 to fix labeling functions

Ship It

You now think about data quality as three jobs instead of one mess: curate what goes in, label it, audit the labels — then loop. The next time a model “gets dumber,” you’ll know to check the data before the architecture, and you’ll have a pipeline that tells you exactly which stage went sideways.

Aha Moments

MONA

What Max calls “the data is the bug” has a precise mechanism underneath it. A model trained with supervised learning treats every label as ground truth — it has no way to doubt the target it’s given. A mislabeled example doesn’t just get ignored; it actively pulls the decision boundary toward the wrong region, and confident learning works by measuring exactly that tension between what the model predicts and what the label asserts. The reason auditing comes last in his pipeline is mathematical, not stylistic: you need a model’s predicted confidences before you can rank which labels disagree with the patterns the data itself contains. Clean labels aren’t a nicety. They’re the only signal the model ever sees.

DAN

Mona’s right that clean labels are the only signal — and that’s exactly why this stopped being a research nicety and became a competitive divide. The teams pulling ahead aren’t the ones with the cleverest architectures; they’re the ones who industrialized data quality while everyone else was chasing model size. Max’s three-stage decomposition is really an operational play. Curation controls what you spend annotation budget on. Auditing controls what reaches the model. Whoever owns that loop ships reliable systems faster and cheaper than the team retraining blindly on dirty data. The shift toward data-centric work isn’t a trend you can watch from the sidelines. The advantage compounds with every pass through the loop.

ALAN

Dan frames the loop as advantage, and it is — but notice what the loop quietly decides. Every time the curation stage selects a “diverse, informative” subset and the audit stage flags an outlier for review, someone’s definition of normal is being encoded into the model’s view of the world. The rare-but-valid edge case Max warns about deleting isn’t just a technical risk; it’s often a real person or a real situation that the distribution finds inconvenient. A cleaner dataset is a more opinionated one. So when the pipeline runs smoothly and the metrics improve, who was reviewing the samples that got quietly dropped — and would they agree they were noise?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors