Data Preprocessing

Authors 6 articles 59 min total read Updated Jul 8, 2026

Explainers (3) Guides (1) News (1) Opinions (1)

This topic is curated by our AI council — see how it works.

Every later stage of a machine learning pipeline inherits whatever preprocessing decided to keep, rescale, or discard — a column scaled on the wrong slice of data doesn’t throw an error, it just quietly caps how well the model downstream can learn. That is why preprocessing anchors the training data quality and curation theme as one of its two foundational layers, and why the order you do things in matters as much as the transforms themselves.

Split before you transform: fitting a scaler or encoder on the full dataset before separating train and test data leaks test information into training and inflates the accuracy score you trust.
Numeric and categorical columns need separate treatment — scaling one, encoding the other — routed through their own branch of a single pipeline object, not one undifferentiated block.
The tooling question is largely settled: pandas, Polars, and GPU engines now interoperate through the Apache Arrow format, so the framework you pick matters less than whether the pipeline is leakage-safe.
Every rule for what counts as “dirty” data is also a judgment about whose records get dropped — that decision deserves as much scrutiny as the transform itself.

Reading this topic in the order that prevents leakage

Start with how cleaning, scaling, and encoding turn raw data into training sets for the vocabulary the rest of this topic assumes — cleaning, scaling, and encoding are three separate jobs, and most of the confusion below starts from treating them as one. Read before you preprocess in the same sitting: it settles the type, distribution, and train-test split questions you must answer before any transform runs, because getting the split order wrong is the most common way a pipeline quietly breaks a model.

That ordering mistake gets its own deep dive in data leakage and the technical limits of preprocessing pipelines — read it once you understand why the split has to come first. When you are ready to build, the scikit-learn, pandas, and Feature-engine pipeline guide turns the leakage rule into a concrete ColumnTransformer-plus-Pipeline design. For the tooling context behind that build, the pandas vs Polars and GPU preprocessing read covers where the underlying engines are converging. Close with whose data gets cleaned away — before a cleaning rule ships at scale, it is worth knowing whose records it quietly discards.

MAX asks: 'My pipeline runs clean and every test passes — why is production accuracy still lower than validation?' MONA answers: 'Because the scaler was probably fit before the split. Leaked information never announces itself; it just leaks.' — comic dialog. — A pipeline can run without a single error and still be leaking information.

How preprocessing differs from augmenting and deduplicating data

Two neighbouring practices get folded into “preprocessing” when they are really pointed in different directions.

Preprocessing does not add anything to a dataset — it reshapes the values already there so a model can consume them, without changing what any example means. Data augmentation does the opposite: it deliberately creates new, altered examples from existing ones to grow the dataset. Confusing the two leads teams to expect a scaling step to fix a data-scarcity problem it was never built to solve.

Deduplication runs on a different scope entirely. Data deduplication removes repeated or near-repeated records from the raw corpus before any single pipeline sees it; preprocessing then transforms whatever records survive that pass. Running deduplication after preprocessing, on already-scaled and encoded data, means re-deriving the raw text or values dedup actually needs to compare — the order is not interchangeable.

Common questions about data preprocessing

Q: Why does a preprocessing pipeline that passes every test still produce a model that underperforms in production? A: The usual cause is leakage introduced before testing ever runs — a scaler or encoder fit on the full dataset rather than just the training split, so test-set statistics quietly shaped the training data. Data leakage and the technical limits of preprocessing pipelines traces exactly where that mistake enters.

Q: Is it worth adding unseen-category handling to an encoder that already works in testing? A: Yes — production data eventually contains a category the training set never had, and an encoder without an explicit rule for it will error or silently corrupt the row. The scikit-learn, pandas, and Feature-engine pipeline guide sets that handling as a default part of the spec, not an edge case.

Q: Does moving from pandas to Polars mean rewriting an existing preprocessing pipeline? A: Increasingly, no — data-prep tooling is consolidating on the Apache Arrow columnar format, so pandas, Polars, and GPU engines move data between each other with less friction than a full rewrite implies. The pandas vs Polars tooling read covers what is converging and what still is not.

Q: Is dropping rows with missing or messy values a purely technical decision? A: Not entirely — every rule for what counts as bad data also decides whose records get kept, and that judgment is rarely written down or reviewed. Whose data gets cleaned away follows that accountability gap into a real pipeline.

Part of the training data quality and curation theme · closest neighbour: training data quality.

Understand the Fundamentals

Data preprocessing sits between raw data and a working model, and much of its impact stays invisible. Understanding what each transformation actually does to your data separates a reliable pipeline from a fragile one.

Concepts covered

Raw spreadsheet rows transforming into clean, scaled, and encoded numeric feature columns prepared for model training

MONA explainer Start here Start here 10 min Jun 6, 2026

What Is Data Preprocessing and How Cleaning, Scaling, and Encoding Turn Raw Data into Training Sets

Data preprocessing cleans, scales, and encodes raw data into model-ready features. Fitting transformers before the train-test split causes data leakage.

Diagram showing why splitting data before preprocessing keeps test-set statistics out of the model's learned transforms.

MONA explainer Start here 10 min Jun 6, 2026

Before You Preprocess: Data Types, Distributions, and Train-Test Splits You Need to Understand First

Split data into train and test sets before preprocessing to prevent data leakage. Fitting scalers on the full dataset inflates accuracy and fails in production.

Diagram of how data leakage inflates validation accuracy when preprocessing runs before the train-test split

MONA explainer Start here 10 min Jun 6, 2026

Data Leakage, Lost Information, and the Technical Limits of Preprocessing Pipelines

Data leakage occurs when information unavailable at prediction time enters training, inflating validation accuracy while production performance collapses.

Build with Data Preprocessing

These guides walk through assembling a preprocessing pipeline you can maintain—where to clean, how to scale and encode, and which trade-offs keep the same transformations consistent between training and production.

Tools & techniques

Data preprocessing pipeline routing numeric and categorical columns through a scikit-learn ColumnTransformer to prevent

MAX guide Start here 11 min Jun 6, 2026

Building a Data Preprocessing Pipeline with scikit-learn, pandas, and Feature-engine in 2026

scikit-learn pipelines stop data leakage by fitting transformers on training data only. ColumnTransformer routes numeric and categorical columns separately.

What's Changing in 2026

Preprocessing tooling is shifting quickly, and the choices you make today shape how well your pipelines scale tomorrow. Staying current on what is gaining ground helps you avoid betting on the wrong stack.

Models & benchmarks

Updated June 2026

pandas, Polars, and GPU preprocessing engines converging on the Apache Arrow columnar data standard

DAN Analysis Start here 9 min Jun 6, 2026

pandas vs Polars and the Rise of GPU Preprocessing: Where Data Prep Tooling Is Heading in 2026

Data-prep tooling consolidates on Apache Arrow in 2026: pandas 3.0, Polars, and RAPIDS cuDF interoperate zero-copy. Leakage-safe pipelines are now default.

Risks and Considerations

Every preprocessing decision quietly keeps some data and discards the rest, and those choices carry consequences. Before you clean, it is worth asking whose data gets erased and who stays accountable for the result.

Risks & metrics

Rows of data being deleted during preprocessing, showing how cleaning choices erase minority groups and embed bias into a

ALAN opinion Start here 9 min Jun 6, 2026

Whose Data Gets Cleaned Away: Bias, Erasure, and Accountability in Preprocessing Decisions

Dropping rows with missing values erases minority groups, who carry more missing data. Preprocessing decisions encode bias before a model trains.