MAX guide 12 min read

How to Build a Training Data Quality Pipeline with Cleanlab, Snorkel, and Lightly in 2026

Diagram of a training data quality pipeline: curate samples, label with weak supervision, then audit labels for errors
Before you dive in

This article is a specific deep-dive within our broader topic of Training Data Quality.

This article assumes familiarity with:

TL;DR

  • A data quality pipeline is three separate jobs — curate what goes in, label it, and audit the labels. Build them as distinct stages, not one script.
  • Define what “a label error” means for your project before you run any tool. The tools detect what you specify, not what you hope.
  • Treat the audit as a measurement, not a vibe. You should be able to show that label quality went up after a pass, with numbers you can point at.

You fine-tune a classifier. Accuracy looks fine in your notebook. You ship it. Two weeks later support tickets pile up because the model confidently labels the wrong thing — and when you trace it back, the same mistake is sitting in your training set, repeated hundreds of times. Nobody audited the labels. The model didn’t fail. It learned exactly what you gave it.

That’s the pattern I see most. The data is the bug, and there’s no stack trace for data.

Before You Start

You’ll need:

  • An AI coding tool you already use (Claude Code, Cursor, or Codex) to scaffold the pipeline stages
  • A working understanding of Training Data Quality and where Label Noise comes from
  • A clear picture of one dataset you actually want to clean — not a hypothetical one

This guide teaches you: how to decompose data quality into three independent stages — curation, labeling, and auditing — so each tool does one job well instead of one script trying to do everything.

The Model That Learned Your Mistakes

Most “the model is dumb” reports are really “the labels were wrong” reports. A handful of mislabeled examples in a small training set bends the decision boundary, and the model dutifully reproduces your error at inference time. This is the whole premise of Data-Centric AI: you get more out of fixing the data than out of swapping architectures.

It passed eval on Friday. On Monday the same model started misclassifying a whole category — because the new batch of labels came from a different annotator who interpreted the guideline differently, and nobody checked the labels against the old ones before training.

Step 1: Separate the Three Jobs Your Pipeline Actually Does

People reach for “a data cleaning script” and end up with one tangled file that selects samples, generates labels, and tries to catch errors all at once. When it breaks, you can’t tell which job broke. Decompose first.

Your pipeline has these parts:

  • Curation — decides which raw samples are worth keeping and labeling. This is where you control diversity, dedup, and Data Provenance. Owned by Lightly.
  • Labeling — produces labels at scale when you don’t have enough hand-labeled data. This is programmatic Weak Supervision. Owned by Snorkel.
  • Auditing — finds the labels that are wrong, the samples that are outliers, and the near-duplicates that inflate your metrics. Owned by Cleanlab.

Each stage has a different input and a different output. Curation takes raw data, returns a selected subset. Labeling takes the subset, returns labels. Auditing takes labeled data, returns a ranked list of suspect examples. Three contracts. Three failure modes you can isolate.

The Architect’s Rule: If you can’t explain the pipeline in three layers, the AI can’t build it either — and you can’t debug it either.

Step 2: Define What “Clean” Means Before You Touch the Data

Here’s the step everyone skips. You can’t detect a label error until you’ve written down what a correct label is. The tools are precise. Your definition has to be too.

Context checklist — specify these before the first run:

  • Label schema and edge cases — the exact classes, and the rule for ambiguous samples that sit between two of them
  • Duplicate policy — what counts as a duplicate. Exact match? Near-match above a similarity threshold? This drives your Data Deduplication rule
  • Outlier tolerance — how far from the distribution a sample can sit before it’s flagged for review instead of trained on
  • Class balance target — whether you’re correcting for Class Imbalance or preserving the real-world ratio
  • Stack and versions — pin them, because these tools have real version constraints (see the note below)

The Spec Test: If your duplicate policy doesn’t define a similarity threshold, the AI will pick one for you — usually exact-match — and your near-duplicate leakage between train and test sails straight through, inflating accuracy you don’t actually have.

Version & compatibility notes:

  • Cleanlab: Use 2.9.0 (released January 13, 2026), which requires Python 3.10 or newer (cleanlab PyPI).
  • Snorkel (open-source): Latest is 0.10.0 (February 2024), Python 3.11+. The open-source package is in low-maintenance mode — the team’s active development moved to the commercial Snorkel Flow platform (snorkel PyPI). Expect slow updates; pin your version and don’t wait on new OSS features.
  • Lightly: Use 1.5.24 (released May 2026). Pin your project to Python 3.12 or lower — Python 3.13 is not yet supported (lightly PyPI). Note that LightlyOne is being superseded by LightlyStudio for new curation work.

Step 3: Sequence the Pipeline — Curate, Label, Audit

Order matters, because each stage feeds the next. Get it wrong and you’ll spend labeling budget on samples you should have thrown away.

Build order:

  1. Curate first, with Lightly — because labeling and auditing both cost time you don’t want to spend on redundant or low-value samples. Lightly’s approach uses self-supervised embeddings plus active-learning selection to pick a diverse, informative subset from a large pool, which is its core curation method (Lightly Blog). For vision datasets especially, this is where you kill redundancy before it costs you anything downstream.
  2. Label next, with Snorkel — because once you’ve chosen what to keep, you need labels for it. Snorkel’s mechanism is labeling functions: small programmatic heuristics that each vote a label or abstain, combined into probabilistic labels by its LabelModel (snorkel PyPI). Use the maintained LabelModel API — older MeTaL and DryBell names are research artifacts, not current packages.
  3. Audit last, with Cleanlab — because auditing only makes sense once labels exist. Cleanlab runs Confident Learning to find label errors, outliers, duplicates, and class-imbalance issues, and its Datalab class is the single entry point across text, image, audio, and tabular data (Cleanlab Docs). It’s model-agnostic — it works from the predicted probabilities of whatever model you already have.

For each stage, your context must specify:

  • What it receives (raw pool / selected subset / labeled set)
  • What it returns (subset / labels / ranked suspect list)
  • What it must NOT do (curation must not relabel; auditing must not silently drop samples)
  • How to handle failure (what happens to abstained labels, to flagged outliers, to ties)

The output of stage three usually loops back: the suspect list from Cleanlab tells you which labels to fix or which labeling functions to rewrite. A data quality pipeline is a loop, not a one-shot.

Step 4: Prove the Data Got Better

“It looks cleaner” is not a result. You specified what clean means in Step 2 — now measure against it.

Validation checklist:

  • Label-error count dropped — failure looks like: you fixed flagged labels but the re-run surfaces the same count, meaning your fixes introduced new errors
  • Train/test duplicate leakage is zero — failure looks like: suspiciously high eval accuracy that collapses on a truly held-out set
  • Outlier set was reviewed, not deleted — failure looks like: your rare-but-valid edge cases vanished, and the model now fails on exactly those in production
  • Label agreement improved — failure looks like: the LabelModel’s confidence stayed flat, meaning your labeling functions conflict more than they agree
Three-stage training data quality pipeline showing curation feeding labeling, labeling feeding auditing, and the audit looping back to fix labels
The pipeline is a loop: the audit stage feeds suspect labels back to curation and labeling.

Common Pitfalls

What You DidWhy AI FailedThe Fix
One “clean my data” scriptCuration, labeling, and auditing have different inputs — one function picks the wrong priorityDecompose into three stages with explicit contracts
No duplicate threshold specifiedTool defaults to exact-match, near-duplicates leak between train and testDefine a similarity threshold in Step 2
Deleted every flagged outlierRare-but-valid edge cases get thrown out, model fails on them in productionRoute outliers to review, not to the trash
Audited before curatingSpent labeling and audit time on redundant samplesCurate first — never audit data you’d have discarded

Pro Tip

Treat the suspect list as a spec change, not a cleanup chore. When Cleanlab flags the same class of error repeatedly, that’s not noise to delete — it’s a labeling guideline that was never written down. Fix the guideline, regenerate the labels, and the whole class of error disappears at the source. The best data quality work edits the rules upstream, not the rows downstream.

Frequently Asked Questions

Q: How to detect mislabeled data in a training set? A: Run a model to get predicted probabilities, then feed them to Cleanlab’s confident learning, which ranks examples by how likely each label is wrong. Watch the threshold: a strict cutoff misses subtle errors, a loose one floods you with false positives you’ll have to triage by hand.

Q: How to build a data cleaning and curation pipeline step by step? A: Decompose into three stages — curate (Lightly), label (Snorkel), audit (Cleanlab) — each with its own input and output contract. The detail most people miss: make it a loop. The audit stage should feed corrections back to labeling, not end the pipeline.

Q: How to use Cleanlab to find label errors and outliers? A: Use the Datalab class as your single entry point — it detects label errors, outliers, duplicates, and class imbalance across text, image, audio, and tabular data. One tip: Cleanlab is model-agnostic, so the quality of your flagged list depends on the predicted probabilities you feed it.

Q: How to curate a training dataset with active learning using Lightly? A: Lightly builds self-supervised embeddings, then uses active-learning selection to pick a diverse, informative subset instead of random sampling. Pin to Python 3.12 or lower for now, and for new projects start on LightlyStudio rather than the older LightlyOne.

Your Spec Artifact

By the end of this guide, you should have:

  • A three-stage pipeline map — curate, label, audit — with the input and output written for each stage
  • A “what clean means” constraint list — label schema, duplicate threshold, outlier tolerance, class-balance target
  • A validation checklist with a named failure symptom for each check, so a passing run is a measured result, not a feeling

Your Implementation Prompt

Paste this into your AI coding tool to scaffold the three pipeline stages. Fill every bracket with the values you defined in Step 2 — the prompt is built around your decomposition, not a generic template.

Build a training data quality pipeline as three independent stages. Do not
merge them into one function — each stage has its own input and output contract.

STAGE 1 — CURATION (Lightly, pin Python <=3.12):
  Input: raw sample pool at [path to raw data]
  Output: a selected subset
  Method: self-supervised embeddings + active-learning selection
  Constraint: must NOT relabel; selection only
  Duplicate policy: flag pairs above similarity threshold [your threshold]

STAGE 2 — LABELING (Snorkel OSS 0.10.0, use the maintained LabelModel API):
  Input: the curated subset
  Output: probabilistic labels
  Label schema: [your exact classes]
  Ambiguous-sample rule: [your rule for samples between two classes]
  Constraint: abstained labels must be routed to [review queue], not dropped

STAGE 3 — AUDITING (Cleanlab 2.9.0, Python >=3.10, via the Datalab class):
  Input: the labeled set + model predicted probabilities
  Output: a ranked list of suspect examples
  Detect: label errors, outliers, duplicates, class imbalance
  Outlier tolerance: [how far from the distribution before flag-for-review]
  Constraint: must NOT silently delete flagged samples

VALIDATION (run after a full pass):
  - Confirm label-error count dropped on re-run
  - Confirm zero train/test duplicate leakage
  - Confirm flagged outliers were reviewed, not deleted
  - Loop suspect list back to Stage 2 to fix labeling functions

Ship It

You now think about data quality as three jobs instead of one mess: curate what goes in, label it, audit the labels — then loop. The next time a model “gets dumber,” you’ll know to check the data before the architecture, and you’ll have a pipeline that tells you exactly which stage went sideways.

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors

Share: