MinHash
Also known as: min-wise hashing, MinHash signatures, minwise independent permutations
- MinHash
- MinHash is a technique that estimates how similar two sets of items are by comparing short numeric signatures instead of the full sets, making it possible to spot near-duplicate documents across massive collections quickly and cheaply.
MinHash is a technique that estimates how similar two sets are by comparing short numeric signatures instead of the full sets, making it practical to find near-duplicate documents across huge collections.
What It Is
When teams build the datasets that train AI models, they pull in billions of documents from the web. Many of those documents are near-duplicates: the same article republished on ten sites, boilerplate footers, or text that differs by a few words. Comparing every document against every other one to find these overlaps is impossibly slow at that scale. MinHash exists to make that comparison fast and affordable.
The core idea is to shrink each document down to a small “fingerprint” called a signature, then compare fingerprints instead of full documents. To build a signature, MinHash first breaks a document into a set of small overlapping pieces, usually short word sequences called shingles. Think of shingles as a sliding window: “the quick brown fox” becomes “the quick brown,” “quick brown fox,” and so on. Two documents that share most of their wording will share most of their shingles.
MinHash then applies a series of hash functions to that set of shingles. A hash function is just a rule that turns each piece of text into a number. For each hash function, MinHash keeps only the single smallest number it produced across all the shingles — that is the “min” in MinHash. Repeat this with many hash functions and you get a row of numbers: the signature. The clever part is the math behind it. The chance that two documents produce the same minimum value for a given hash function equals the actual overlap between their shingle sets. So by counting how many signature positions match, you get a reliable estimate of similarity without ever comparing the documents directly.
Because signatures are tiny and fixed in size, they turn an unmanageable problem into a tractable one. A document of any length collapses to the same compact fingerprint, and comparing two fingerprints is a matter of counting matching slots.
How It’s Used in Practice
The most common place MinHash shows up is deduplication of training data for large language models. Before a model is trained, engineers run the raw corpus through a deduplication pipeline that uses MinHash to flag documents that are near-identical to ones already kept. Removing those duplicates matters because repeated text pushes a model to memorize passages instead of learning general patterns, and it wastes expensive compute on redundant examples.
In a typical pipeline, MinHash is paired with a partner technique called Locality-Sensitive Hashing (LSH), which groups similar signatures into the same buckets so the system only compares documents that are likely to match. This combination is what makes web-scale deduplication finish in hours rather than years. Curation toolkits used by data teams build this MinHash plus LSH flow in as a standard step.
Pro Tip: The two knobs that matter most are shingle size and similarity threshold. Smaller shingles catch looser paraphrases but flag more false positives; a higher threshold keeps only tight duplicates. Run a small sample first and eyeball the pairs it flags before running it on the full corpus — the right settings depend on your data, not on a universal default.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Finding near-duplicate documents across a web-scale corpus | ✅ | |
| Detecting whether two texts mean the same thing in different words | ❌ | |
| Deduplicating training data before a model run | ✅ | |
| Comparing only a handful of short documents | ❌ | |
| Estimating overlap when exact comparison is too slow | ✅ | |
| Catching duplicates that share meaning but no wording | ❌ |
Common Misconception
Myth: MinHash understands what a document means and finds duplicates based on meaning. Reality: MinHash only measures overlap in actual wording — shared sequences of characters or words. Two passages that express the same idea with completely different words look unrelated to it. Catching meaning-based duplicates is the job of semantic deduplication, which compares embeddings instead of shingles.
One Sentence to Remember
MinHash trades a tiny bit of accuracy for an enormous gain in speed, letting you estimate how much two documents overlap by comparing small fingerprints — the practical foundation for cleaning duplicates out of massive training datasets.
FAQ
Q: What is MinHash used for? A: It estimates how similar two sets are from compact signatures, most often to find and remove near-duplicate documents in large datasets before training an AI model.
Q: What is the difference between MinHash and LSH? A: MinHash builds the similarity fingerprints; LSH groups similar fingerprints into buckets so you only compare likely matches. They are usually used together for fast deduplication.
Q: Does MinHash find documents with similar meaning? A: No. It only detects overlap in actual wording. Catching duplicates that share meaning but use different words requires semantic deduplication based on embeddings.
Expert Takes
MinHash works because of a precise probabilistic guarantee, not a heuristic. The probability that two sets share a minimum hash value equals their Jaccard similarity exactly. That single property is what lets a row of numbers stand in for a full document comparison. Understanding the guarantee, rather than just running the tool, is what tells you when the estimate can be trusted.
Treat MinHash as a configured stage in a deduplication spec, not a black box you drop in. The shingle size, number of hash functions, and similarity threshold are parameters you declare and version alongside the rest of your data pipeline. Write them down, test them on a sample, and the same corpus will deduplicate the same way every time you run it.
As training corpora balloon, the teams that can clean data cheaply move faster than the ones that cannot. MinHash is unglamorous plumbing, but it is the kind of plumbing that decides whether a data pipeline scales or stalls. Deduplication quality is quietly becoming a competitive line between serious model builders and everyone else.
Deduplication shapes what a model sees, and that gives it quiet power. Decide the threshold too loosely and you strip out legitimate variety; too tightly and memorized text slips through. Nobody votes on these settings, yet they influence what the model learns and forgets. The honest question is who reviews these choices, and whether anyone outside the pipeline ever sees them.