Unigram Tokenization

Also known as: Unigram model, Unigram LM, Unigram segmentation

Unigram Tokenization
A probabilistic subword tokenization method that starts with a large candidate vocabulary and iteratively removes tokens whose loss contributes least, selecting the highest-probability segmentation for each word during inference.

Unigram tokenization is a probabilistic subword segmentation algorithm that builds a vocabulary by starting large and pruning low-value tokens, selecting the most likely segmentation for each input.

What It Is

When an LLM reads your text, it doesn’t see words — it sees tokens. How those tokens get carved out of raw text determines how well the model handles language. Unigram tokenization is one of three major subword approaches (alongside BPE and WordPiece) to this splitting problem, and understanding how it differs from the others is central to grasping tokenizer architecture as a whole.

Most tokenization methods build upward. BPE, for instance, starts with individual characters and repeatedly merges the most frequent pairs until the vocabulary reaches a target size. Unigram does the opposite. Think of it like sculpting: instead of assembling a statue piece by piece, you start with a block of marble and chisel away what you don’t need. The algorithm begins with a large set of candidate tokens — often hundreds of thousands — and removes the ones that contribute least to the language model’s ability to represent text.

According to Kudo 2018, the Unigram model was introduced as part of “Subword Regularization,” a technique for improving neural machine translation. The core idea: given a sentence, there isn’t just one correct way to split it into subwords. The word “tokenization” could be split as [“token”, “ization”], [“to”, “ken”, “ization”], or several other combinations. Unigram assigns a probability to each possible segmentation and picks the one with the highest likelihood during inference.

This probabilistic nature is what separates Unigram from BPE in tokenizer architecture. BPE always produces the same segmentation for a given input because its merge rules are fixed and deterministic. According to HuggingFace Docs, Unigram can sample different segmentations during training, which acts as a form of regularization — the model sees slightly different token sequences for the same text, making it more resilient to unusual inputs.

The pruning process works iteratively. At each step, the algorithm calculates how much the total loss would increase if each token were removed. According to HuggingFace Docs, the bottom 10-20% of tokens — those whose removal increases loss the least — get cut. This cycle repeats until the vocabulary shrinks to the desired size.

How It’s Used in Practice

You encounter Unigram tokenization most often through models built on Google’s SentencePiece library. According to Google’s GitHub, SentencePiece supports both BPE and Unigram as training algorithms, and it powers the tokenizers behind T5, mT5, BigBird, and Pegasus. When you send text to any of these models — whether through an API, a Hugging Face pipeline, or a research notebook — the Unigram tokenizer is splitting your input behind the scenes.

In the broader context of tokenizer architecture, Unigram occupies the probabilistic slot in a three-way comparison with BPE and WordPiece. If you’re evaluating which tokenization strategy fits a multilingual model or a domain where unusual word forms are common (medical terminology, code-switching between languages), Unigram’s ability to consider multiple segmentation paths gives it an advantage over BPE’s fixed merge rules.

Pro Tip: If you’re fine-tuning or training a model on text that mixes scripts and languages, pair Unigram with SentencePiece. It handles unseen character combinations more gracefully than BPE because it evaluates segmentation probability across the full vocabulary rather than depending on pre-learned merge sequences.

When to Use / When Not

ScenarioUseAvoid
Training a multilingual model across diverse scripts
Using subword sampling as training-time regularization
Extending an existing BPE-based model’s tokenizer
Processing domain text with rare word forms (biomedical, legal)
Needing exact compatibility with GPT or Claude tokenizers

Common Misconception

Myth: Unigram tokenization produces random, unpredictable token splits because it’s “probabilistic.”

Reality: During inference, Unigram uses Viterbi decoding to find the single highest-probability segmentation — producing consistent, deterministic output. The probabilistic element only applies during training, where sampling alternative segmentations acts as data augmentation to improve model quality.

One Sentence to Remember

Unigram tokenization sculpts a vocabulary by removing what doesn’t matter rather than building up what does, and its probabilistic training makes models more resilient to messy, real-world text — a fundamentally different philosophy from BPE that matters when choosing a tokenizer architecture.

FAQ

Q: How does Unigram tokenization differ from BPE? A: BPE builds vocabulary bottom-up by merging frequent character pairs. Unigram works top-down, starting with a large vocabulary and pruning low-value tokens based on how much each removal increases overall loss.

Q: Which major models use Unigram tokenization? A: Google’s T5, mT5, BigBird, and Pegasus all use Unigram through the SentencePiece library. GPT-family models and Claude use BPE-based tokenizers instead.

Q: Can Unigram tokenization handle multiple languages? A: Yes. Its probabilistic segmentation handles diverse scripts and rare word forms better than fixed merge rules, which is why it’s the default choice in multilingual models like mT5.

Sources

Expert Takes

The mathematical appeal of Unigram lies in its loss-based pruning. Where BPE makes greedy local decisions about which pairs to merge, Unigram evaluates each candidate token’s global contribution to the language model’s likelihood. This makes its vocabulary construction more principled from an information-theoretic perspective. The subword regularization aspect — sampling alternative segmentations during training — provides an implicit data augmentation effect that reduces overfitting to surface-level token patterns.

If you’re building a tokenizer pipeline and need to decide between BPE and Unigram, ask one question: does your training data contain consistent patterns or chaotic variation? BPE rewards frequency. Unigram rewards probability. For English-only production systems with stable inputs, BPE does the job. For anything multilingual or domain-mixed, Unigram’s probabilistic vocabulary handles edge cases that BPE’s rigid merges will miss.

Most teams never choose their tokenizer — they inherit whatever the foundation model shipped with. But for organizations training custom models on specialized data, the BPE-versus-Unigram decision directly affects output quality. Multilingual products, medical documentation systems, and cross-border applications all benefit from Unigram’s flexibility. The teams that understand their tokenizer architecture ship better products than those who treat it as a black box.

The choice of tokenization algorithm quietly shapes who gets served well and who doesn’t. BPE’s frequency bias means languages with smaller training corpora get worse segmentation — longer token sequences, higher costs, lower quality outputs. Unigram’s probabilistic approach can partially address this imbalance, but only if the training data includes those languages in the first place. The algorithm alone doesn’t fix representation problems baked into the dataset.