MAX guide 11 min read June 6, 2026 Updated July 8, 2026

Building a Data Preprocessing Pipeline with scikit-learn, pandas, and Feature-engine in 2026

Data preprocessing pipeline routing numeric and categorical columns through a scikit-learn ColumnTransformer to prevent

TL;DR

Fit every transformer on training data only. The split comes first — always.
One branch per data type. Numeric columns and categorical columns travel separate paths through a single ColumnTransformer.
Wrap the whole thing in a Pipeline object. The spec becomes the code that runs, and the leak has nowhere to hide.

Your model scored 0.94 on validation. In production it dropped to 0.71. Nobody touched the model. The architecture is identical. The only thing that changed is that the training set is no longer secretly informing the test set — because in production, there is no test set to leak from. The preprocessing was the bug. It always is.

Before You Start

You’ll need:

An AI coding tool — Claude Code, Cursor, or Codex — to generate the implementation
A working grasp of Data Preprocessing and why a Train Test Split exists
A clear picture of your dataset: which columns are numbers, which are categories, which leak the target

This guide teaches you: how to decompose preprocessing into independent, order-aware components so your AI tool builds a pipeline that fits on training data only — and proves it.

The 0.94 That Became 0.71

Here is the most common failure I see. Someone loads the full dataset, scales every numeric column, imputes the missing values, encodes the categories — and then splits into train and test. The scaler computed its mean from rows that include the test set. The imputer learned its fill values from data the model is supposed to have never seen. That is Data Leakage, and it inflates every validation score you look at.

It worked on Friday because validation looked great. On Monday it broke, because production data arrives one row at a time with no test set to borrow statistics from, and the model was never actually trained for that world.

Step 1: Map the Preprocessing as a Graph, Not a Script

Stop thinking of preprocessing as a sequence of lines you run top to bottom. Think of it as a graph with one hard boundary and two parallel branches. If you can draw it, the AI can build it.

Your system has these parts:

The split boundary — train_test_split runs first and acts as a firewall. Nothing fitted may see across it.
The numeric branch — Missing Data Imputation followed by Feature Scaling ( Standardization or Normalization, depending on your model).
The categorical branch — imputation followed by Categorical Encoding, usually One Hot Encoding for low-cardinality columns.
The orchestrator — a ColumnTransformer that routes each column to the right branch, wrapped in a Pipeline that chains preprocessing into the estimator.

The Architect’s Rule: If you can’t explain the system in three layers — boundary, branches, orchestrator — the AI can’t build it either.

Step 2: Lock Down the Contract

The AI defaults to whatever its training data saw most often. That means fit-on-everything, because that is what most tutorials show. You have to specify the firewall explicitly, or it disappears.

Context checklist:

Library versions pinned. scikit-learn 1.9.0 supports Python 3.11 through 3.14 and added narwhals for dataframe interoperability (scikit-learn Docs). Pandas 3.0.3 makes PyArrow a required dependency and turns on Copy-on-Write by default (pandas Blog).
Column types declared — which columns are numeric, which are categorical, by name not by guess.
Imputation strategy chosen per branch — median for skewed numerics, most-frequent or a constant sentinel for categories.
Encoding strategy chosen — and a rule for unseen categories at inference time (handle_unknown="ignore" is not optional in production).
The fit contract stated in one sentence: every transformer is fitted inside cross-validation folds, on training rows only.
Output type decided — Feature-engine 1.9.4 returns pandas DataFrames with named features through standard fit()/transform() methods (Feature-engine Docs), which keeps your feature names alive downstream instead of collapsing to an anonymous array.

The Spec Test: If your context doesn’t say “fit only on the training fold,” the AI will fit on the full frame, your cross-validation scores will look better than reality, and you will not find out until production.

Step 3: Wire the Branches in the Right Order

Order is not cosmetic here. It is the difference between a leak and a firewall. Build it in the only sequence that holds the boundary.

Build order:

Split first — train_test_split before any transformer is touched, because every downstream fit must be blind to the test rows.
Build each branch as its own mini-pipeline — impute then transform, because scaling a column that still has NaNs produces garbage, and the dependency runs imputation → scaling, never the reverse.
Compose with ColumnTransformer last — because it integrates the numeric and categorical branches into one object that the estimator and cross-validation can both see.

For each component, your context must specify:

What it receives — the named subset of columns it owns
What it returns — transformed columns, with names preserved if you route through Feature-engine
What it must NOT do — never call fit outside the cross-validation loop
How to handle failure — unseen categories ignored, missing values imputed, no silent row drops

Step 4: Prove the Leak Is Gone

You do not trust a pipeline because it ran without error. You trust it because you checked the things that fail silently.

Validation checklist:

Fit happens inside cross-validation — failure looks like: validation accuracy noticeably higher than a held-out test set you never touched.
Feature names survive the transform — failure looks like: a downstream step references a column by name and throws a KeyError.
Unseen categories handled at inference — failure looks like: a crash on the first production row containing a value not present at training time.
Outlier Detection applied before scaling, not after — failure looks like: a standard scaler whose mean is dragged by extremes you meant to clip.

Diagram showing train-test split as a firewall feeding numeric and categorical branches into a ColumnTransformer, then into a single Pipeline estimator — The split is the firewall; numeric and categorical branches converge in one ColumnTransformer wrapped by a Pipeline.

Security & compatibility notes:
pandas 3.0 (Copy-on-Write): PyArrow is now a required dependency and Copy-on-Write is the default. Chained-assignment mutation patterns from 2.x no longer behave as before (pandas Blog). Action: refactor in-place edits to explicit .loc assignments or fresh copies before you pin 3.x in a preprocessing job.
scikit-learn 1.9.0 (Python floor): Drops support for Python below 3.11 and adds the narwhals dependency (scikit-learn Docs). Action: confirm your runtime is Python 3.11+ before upgrading.

Common Pitfalls

What You Did	Why AI Failed	The Fix
Scaled and imputed before splitting	The AI followed the most common tutorial pattern, which leaks	Specify “split first, fit transformers on training folds only”
Passed columns as one undifferentiated block	The AI guessed numeric vs. categorical and mislabeled some	Declare column types by name in your context
Skipped the unseen-category rule	The AI generated happy-path encoding only	Add `handle_unknown="ignore"` to the encoding spec
Let the transform return a raw array	The AI optimized for fewer lines, dropping feature names	Route through Feature-engine to keep named DataFrame output

Pro Tip

The Pipeline object is your specification made executable. Anything that touches the data and learns from it — a fill value, a mean, a category vocabulary — must live inside that object, not in a loose cell above it. The moment a transformation happens outside the pipeline, your cross-validation is lying to you, because that step already saw the data it is about to be tested on. Treat the Pipeline boundary as the same kind of contract you would put around any module: state goes in through one door, or it does not go in at all.

Frequently Asked Questions

Q: How to build a preprocessing pipeline with scikit-learn ColumnTransformer and Pipeline?

A: Define two branch pipelines — numeric and categorical — each ordering imputation before transformation. Route columns by name through a ColumnTransformer, then wrap that in a Pipeline ending in your estimator. The detail tutorials skip: pass remainder="drop" so stray columns never leak in unspecified.

Q: How to handle missing values and encode categorical features for model training?

A: Impute numerics with the median, categories with most-frequent or a constant sentinel, then one-hot encode. The trap most teams hit: set handle_unknown="ignore" on the encoder, or the first unseen category in production crashes the entire inference call.

Q: When should you use pandas vs Polars for data cleaning in 2026?

A: Use pandas when your data feeds straight into scikit-learn — it is the native integration path. Reach for Polars on large upstream cleaning jobs, where benchmarks report it running several times faster. The catch: Polars does not plug into the sklearn Pipeline, so convert to pandas before fitting.

Your Spec Artifact

By the end of this guide, you should have:

A component map — the firewall, the two branches, and the orchestrator, drawn before any code exists
A constraint list — pinned versions, column types by name, per-branch imputation and encoding rules, the unseen-category rule, and the fit-on-training-only contract
A validation criteria set — fit happens inside cross-validation, feature names survive, unseen categories are handled, outliers are clipped before scaling

Your Implementation Prompt

Paste this into your AI coding tool after replacing every bracket with your own values. It mirrors the four steps above, so what you specified is exactly what the tool builds.

Build a scikit-learn preprocessing pipeline. Do not fit any transformer
outside cross-validation.

CONTEXT (pin these):
- scikit-learn [1.9.0], pandas [3.0.x], Feature-engine [1.9.4], Python [3.11+]
- Numeric columns: [list by name]
- Categorical columns: [list by name]
- Target column: [name] — must never enter a transformer

STEP 1 — STRUCTURE:
- Run train_test_split FIRST as the firewall.
- Numeric branch: imputation([median]) -> scaling([standardization/normalization]).
- Categorical branch: imputation([most_frequent/constant]) -> encoding([one-hot],
  handle_unknown="ignore").
- Combine both branches in a ColumnTransformer with remainder="drop".

STEP 2 — CONTRACT:
- Every transformer is fitted on training folds only.
- Preserve feature names (use Feature-engine transformers where possible).
- Unseen categories at inference must not crash.

STEP 3 — ORDER:
- Impute before transform in each branch.
- Wrap the ColumnTransformer + [estimator] in a single Pipeline.

STEP 4 — VALIDATE:
- Report cross-validated scores, not a single train/test number.
- Confirm the pipeline runs on a held-out row containing an unseen category.

Ship It

You now read preprocessing as a graph with a firewall, not a script you run top to bottom. You can point at the exact line where a leak enters, and you can specify it out of existence before the AI writes a single transformer. That mental model travels to every dataset you touch next.

Aha Moments

MONA

What MAX calls a firewall is, underneath, a question about which distribution your statistics are estimating. A scaler computes a mean and a variance; an imputer learns a central value; an encoder builds a vocabulary. Each is a small model fitted to data. Let it see the test rows and you have estimated the wrong distribution — one that includes information the deployed model will never possess. The validation score then measures memorization of a sample, not generalization to a population. The discipline of fitting inside the fold is not bureaucracy. It is the only way to keep your estimate of error honest, because the estimate is only meaningful when the data it was measured on stayed genuinely unseen.

DAN

Mona frames it as estimating the right distribution. From where I sit, this is the difference between a demo that wins the meeting and a system that survives contact with customers. Inflated validation scores are a sales pitch you make to yourself, and the market collects the debt later — in churned users, in a rollback, in trust you do not get back. The teams pulling ahead right now treat preprocessing as production infrastructure, not a notebook afterthought. They standardize the pipeline once and reuse it everywhere. That is how you ship faster without shipping fragile, and speed with reliability is the only combination that compounds in your favor over a release cycle.

ALAN

Dan talks about debt the market collects. I would ask who signs for it. A leaked test set does not announce itself — the model looks excellent right up until a real person depends on its output. The specification MAX teaches is, quietly, an ethical instrument: it forces you to state what the system may and may not learn from, before anyone is affected by the answer. That honesty is cheap to install now and expensive to retrofit after a decision has already gone wrong. So here is the question worth sitting with: if your pipeline could not see the difference between knowing and pretending to know, how would you ever find out before your users did?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors