Suffix Array

Also known as: suffix index, sorted suffix index, SA

Suffix Array
A suffix array is a sorted list of every suffix of a text, stored as starting positions, that lets software locate repeated substrings quickly — the core data structure behind exact-substring deduplication of large training corpora.

A suffix array is a sorted array of all the suffixes of a text, stored as their starting positions, that lets a program find every repeated substring in a large dataset quickly and with little extra memory.

What It Is

When you train a language model on text scraped from the web, the same passages show up again and again — boilerplate, reposted articles, license headers, mirrored documentation. A model that sees the same paragraph many times tends to memorize it word for word and waste training steps on redundant text. Finding those exact repeats across a corpus with billions of words is the problem a suffix array solves. It is the index that makes “show me every chunk of text that appears more than once” a fast question instead of an impossible one.

A suffix is just the tail of a text starting at some position: for the word “data,” the suffixes are “data,” “ata,” “ta,” and “a.” A suffix array lists the starting position of every suffix in the whole document, then sorts those positions by the text that follows them, alphabetically. Think of it like the index at the back of a textbook, except instead of indexing chosen keywords, it indexes every possible starting point in the text and keeps them in sorted order.

Sorting matters because it puts identical stretches of text right next to each other. Once the suffixes are sorted, any substring that appears twice shows up as two neighboring entries that begin with the same characters. Comparing each entry to the one above it reveals every duplicated span without ever comparing every passage to every other passage. The array stores positions, not copies of the text, so it stays compact even for enormous inputs.

In deduplication work, the suffix array is built over the entire concatenated corpus — every document joined into one long string. According to arXiv, the exact-substring method removes duplicated spans of at least roughly fifty tokens found anywhere in the dataset, which is long enough to catch real repeated passages but short enough to leave normal language alone.

How It’s Used in Practice

Most people meet suffix arrays not by writing one but by running a dedup tool that uses one under the hood. According to the text-dedup Docs, the open-source text-dedup library exposes a SuffixArray script that performs exact-substring deduplication, sitting alongside its MinHash and other fuzzy-matching methods. You point it at your cleaned corpus, it builds the suffix array, finds the repeated spans, and writes out a dataset with the duplicate stretches removed. The reference implementation it descends from, according to GitHub, is Google Research’s deduplicate-text-datasets, where the matching is written in Rust for speed and wrapped in Python scripts.

This is the “exact” half of a typical dedup stack. Teams usually run fuzzy near-duplicate detection (MinHash, comparing whole documents) and exact-substring detection (suffix array, comparing byte or token spans) as complementary passes — one catches reworded copies, the other catches verbatim reuse.

Pro Tip: Run exact-substring dedup as a distinct pass, not a replacement for document-level fuzzy dedup. The suffix array is excellent at catching verbatim repeats but says nothing about paraphrased duplicates; reach for MinHash or semantic methods when the copies have been lightly edited.

When to Use / When Not

ScenarioUseAvoid
Removing verbatim repeated passages from a training corpus
Catching paraphrased or reworded duplicate documents
Reducing a model’s tendency to memorize boilerplate text
Comparing two short strings where a plain scan is simpler
Finding the longest repeated substring across a huge dataset
Grouping documents by overall topic similarity

Common Misconception

Myth: A suffix array compares whole documents to decide which ones are duplicates. Reality: It works at the substring level, not the document level. It finds repeated spans of text wherever they occur — even a shared paragraph buried inside two otherwise different documents. Document-level near-duplicate detection is a separate job, usually handled by methods like MinHash, and the two are run together for full coverage.

One Sentence to Remember

A suffix array is the sorted index that makes “find every exact repeat in this corpus” a fast operation, which is why it sits at the core of exact-substring deduplication; pair it with a fuzzy method to also catch the reworded copies.

FAQ

Q: What is a suffix array used for in AI training? A: It powers exact-substring deduplication — finding and removing text spans that appear verbatim more than once in a training corpus, which cuts memorization and wasted training on redundant data.

Q: What is the difference between a suffix array and MinHash for deduplication? A: A suffix array finds exact repeated substrings between documents. MinHash estimates how similar whole documents are, catching reworded near-duplicates. Most pipelines run both as complementary passes.

Q: Why sort the suffixes at all? A: Sorting places identical and overlapping text spans next to each other, so duplicates surface by comparing neighbors in the list instead of checking every passage against every other passage.

Sources

Expert Takes

Not magic. Sorting. A suffix array gives every starting position in the text an address and orders those addresses by what follows them. Identical spans become neighbors, so repeated text reveals itself through local comparison rather than exhaustive search. The data structure stores positions, not text copies, which is why it scales to corpora that would otherwise be far too large to compare span against span.

Treat exact-substring dedup as a named stage in your data spec, separate from fuzzy dedup. The failure mode is assuming one pass covers both verbatim and reworded copies — it does not. Wire the suffix-array step to run over the concatenated corpus, document the minimum match length you chose, and keep its output auditable. Specifying the two passes distinctly is what makes the cleaning reproducible.

Clean training data stopped being a nice-to-have and became table stakes. Teams that dedup aggressively get more model quality per training dollar, because they stop paying to memorize the same boilerplate. The suffix array is unglamorous plumbing, but plumbing that quietly lowers compute cost and lifts evaluation scores is exactly the kind of advantage that compounds across every model a team ships.

A deduplication choice is also an editorial choice. Decide what counts as a duplicate worth deleting, and you decide which voices get thinned and which survive into the model’s view of the world. The suffix array is neutral machinery, but the threshold someone sets on top of it is not. Who reviews that threshold, and who is accountable when over-aggressive cleaning erases something that mattered?