Text Dedup
Also known as: text-dedup, text deduplication library, training data dedup tool
- Text Dedup
- Text Dedup is an open-source Python library that removes duplicate and near-duplicate text from large datasets, bundling MinHash, SimHash, suffix array, and bloom filter methods behind one command-line interface to clean LLM training corpora.
Text Dedup is an open-source Python library that removes duplicate and near-duplicate text from large datasets, packaging several deduplication algorithms behind a single command-line tool to clean the corpora that train language models.
What It Is
Anyone who has assembled a large text dataset runs into the same problem: the same content shows up over and over. A news article gets republished across fifty sites, a license agreement appears in thousands of code repositories, a Wikipedia paragraph gets scraped a dozen times. When that data trains a language model, the repetition teaches the model to memorize instead of generalize, wastes compute on redundant examples, and can leak verbatim text at generation time. Text Dedup exists to find and strip those duplicates before training begins.
Think of it like the “find duplicate files” feature on your computer, except instead of comparing whole files byte-for-byte, it can also catch documents that are almost identical — the same article with a different headline, or a paragraph reused with two words swapped. That “almost identical” detection is the hard part, and it’s why a dedicated tool exists.
Under the hood, Text Dedup bundles four different methods, each suited to a different kind of duplicate. MinHash combined with Locality-Sensitive Hashing (LSH) finds near-duplicates — documents that share most, but not all, of their content — by estimating how much two texts overlap without comparing every pair directly. SimHash does similar near-duplicate detection using compact fingerprints. A suffix array finds exact repeated substrings, useful for catching long passages copied between documents. A bloom filter catches exact whole-document duplicates with very little memory. According to text-dedup, the library also ships a Spark implementation of MinHash so the same job can run across datasets measured in terabytes.
You point the tool at a dataset, choose a method, set a similarity threshold, and it outputs a cleaned version. Configuration happens through TOML files rather than long command strings, which keeps a deduplication run reproducible and easy to share with a teammate.
How It’s Used in Practice
The mainstream use is a preprocessing step in a machine learning data pipeline. A team preparing a training corpus — often a Hugging Face dataset — runs Text Dedup as one stage between collecting raw text and tokenizing it for the model. They typically start with MinHash + LSH to remove near-duplicate documents, since that catches the most common form of bloat: the same content syndicated, mirrored, or lightly edited across many sources.
The similarity threshold is the dial that matters. Set it too loose and obvious duplicates slip through. Set it too tight and the tool starts flagging documents that merely share a common template or boilerplate footer as “duplicates,” deleting genuinely distinct content and quietly narrowing the diversity of the dataset. That trade-off — false positives versus missed duplicates — is the central tension in any deduplication effort, and it’s why the threshold gets tuned against samples rather than guessed.
Pro Tip: Before running a full pass, deduplicate a small sample and read the pairs the tool flagged as matches. If it’s grouping documents that only share a navigation menu or a license header, your threshold is too aggressive — loosen it, or strip the boilerplate first. Eyeballing fifty flagged pairs saves you from silently deleting half your long-tail content.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Cleaning a large LLM training corpus before tokenization | ✅ | |
| Removing syndicated or mirrored near-duplicate web pages | ✅ | |
| Deduplicating a handful of files you can eyeball manually | ❌ | |
| Terabyte-scale datasets needing distributed processing | ✅ | |
| Finding documents with the same meaning but different words | ❌ | |
| Deduplicating structured database rows (use a database instead) | ❌ |
Common Misconception
Myth: Deduplication is a safe, lossless cleanup step — more aggressive deduplication always produces a better dataset.
Reality: Every deduplication pass makes a judgment call about what counts as “the same,” and that call has costs. Too aggressive a threshold treats distinct documents that share boilerplate or a common format as duplicates and deletes them, shrinking the diversity the model learns from. Text Dedup gives you the methods and the threshold control, but it can’t decide for you where useful repetition ends and harmful redundancy begins.
One Sentence to Remember
Text Dedup is the practical toolbox for removing duplicate text from training data — the value isn’t in running it but in tuning its threshold so you cut redundancy without cutting diversity.
FAQ
Q: What’s the difference between Text Dedup and just removing exact duplicate lines?
A: Removing exact duplicates only catches byte-for-byte copies. Text Dedup also finds near-duplicates — documents that are mostly the same with small edits — using methods like MinHash, which a simple line-matching script cannot detect.
Q: Which deduplication method should I start with?
A: Most teams start with MinHash plus LSH for near-duplicate documents, the most common form of dataset bloat. Use a suffix array for exact repeated passages, or a bloom filter for memory-efficient exact whole-document matching.
Q: Can Text Dedup find documents that mean the same thing but use different words?
A: No. Text Dedup compares text surface features, not meaning, so a reworded passage may slip through. Catching same-meaning duplicates requires semantic deduplication, which compares embedding vectors rather than text overlap.
Sources
- text-dedup: ChenghaoMou/text-dedup repository - Source repository documenting the available methods and Spark implementation.
- text-dedup PyPI: text-dedup PyPI package page - Package page listing the current release and license.
Expert Takes
Deduplication is not deletion. It is a similarity judgment encoded as a threshold. MinHash estimates document overlap without comparing every pair, which is what makes web-scale cleaning feasible at all. The interesting question is not whether two documents match, but how much overlap you decide counts as the same — that boundary is a modeling choice, not a fact the data hands you.
Treat deduplication as a configured, reproducible stage, not an ad-hoc script. The TOML-driven setup means a run is a file you can version, review, and rerun against the same threshold. The failure most teams hit is silent: a too-tight threshold deletes distinct content that merely shares boilerplate. Build a sample-and-inspect check into the pipeline before you trust a full pass.
Clean training data is becoming a real competitive line, and tooling like this turns corpus hygiene from a research chore into a repeatable step. The teams that win aren’t the ones with the most data — they’re the ones who know what to throw away. Deduplication is where that discipline starts, and open-source tooling means it’s table stakes now, not an edge.
Every threshold decides what disappears from a model’s view of the world. Set it carelessly and you strip out minority phrasings, rare dialects, and long-tail documents that happen to share a template with something common. Who audits that loss? The tool reports how many documents it removed, but not whose voice was in them. Efficiency and erasure can look identical in a log file.