Masked Language Modeling

Also known as: MLM, masked LM, cloze task

Masked Language Modeling: A self-supervised pre-training technique where random tokens in a sentence are hidden behind a mask and the model learns to predict them using surrounding context from both directions, enabling deep bidirectional language understanding.

Masked language modeling is a pre-training method where a model learns to predict hidden words in a sentence by reading surrounding context from both directions simultaneously.

What It Is

Before a language model can answer questions or generate code, it needs to learn what language looks like. Pre-training is where that learning happens, and masked language modeling (MLM) is one of the most widely adopted methods for teaching a model to understand text rather than just produce it. Instead of reading sentences left to right the way you read a book, an MLM-trained model looks at context from both sides — left and right — to fill in deliberately hidden words.

Think of it like a fill-in-the-blank test. Take a sentence: “The cat sat on the ____.” The model must figure out that “mat” or “floor” fits based on the surrounding words. According to Devlin et al., the original approach masks roughly 15% of tokens (the individual word-pieces the model processes) in each training sequence. The model then tries to predict each masked token by attending to every other visible token in the sentence. This bidirectional attention — reading both forward and backward at the same time — is what separates MLM from autoregressive methods used by GPT-family models, which can only look at tokens that came before the current position.

According to Hugging Face Docs, notable models trained with MLM include BERT, RoBERTa, ALBERT, DeBERTa, and ModernBERT. While autoregressive (left-to-right) objectives now dominate the largest language models built for text generation, MLM remains the standard training approach for encoder-based models. These encoders power everyday tasks like search ranking, text classification, named entity recognition, and semantic similarity — areas where understanding the full context of a sentence matters more than generating new text.

How It’s Used in Practice

Where do you actually run into MLM-trained models? Most commonly in search and classification systems. When you type a query into a search engine or an enterprise knowledge base, an encoder model (likely trained with MLM) converts your query and candidate documents into numerical representations and finds the closest matches. Sentence-transformer libraries, widely used in retrieval-augmented generation (RAG) pipelines, rely on encoder backbones pre-trained with masked language modeling.

You also encounter MLM indirectly whenever a tool classifies your support ticket, flags a suspicious email, or extracts key entities from a contract. These tasks need a model that reads the entire input at once — exactly what MLM training produces.

Pro Tip: If you’re building a text classification or search feature, reach for a pre-trained encoder model (like DeBERTa or ModernBERT) rather than a large autoregressive model. Encoders trained with MLM are smaller, faster, and often more accurate for understanding-focused tasks than their generative counterparts.

When to Use / When Not

Scenario	Use	Avoid
Text classification or sentiment analysis	✅
Open-ended text generation (chatbots, creative writing)		❌
Semantic search or document retrieval	✅
Code completion requiring left-to-right prediction		❌
Named entity recognition in documents	✅
Long-form summarization requiring fluent output		❌

Common Misconception

Myth: Masked language modeling is outdated because GPT-style models don’t use it. Reality: MLM and autoregressive pre-training serve different purposes. Autoregressive models excel at generation. MLM-trained encoders excel at understanding — classification, retrieval, and similarity matching. Most production NLP systems still run encoder models under the hood, even when a generative model handles the user-facing conversation.

One Sentence to Remember

Masked language modeling teaches a model to understand language by hiding words and forcing it to read in all directions to fill in the gaps — and that’s why MLM-trained encoders remain the backbone of search, classification, and retrieval systems, even while generative models grab the spotlight.

FAQ

Q: What is the difference between masked language modeling and autoregressive modeling? A: Masked language modeling reads context from both directions to predict hidden tokens. Autoregressive modeling reads left to right and predicts the next token in sequence. MLM favors understanding; autoregressive favors generation.

Q: Why do MLM models mask only 15% of tokens? A: According to Devlin et al., masking too many tokens removes too much context for reliable prediction, while masking too few slows training because the model gets less learning signal per sequence.

Q: Can I use a masked language model like BERT for chatbot responses? A: Not directly. MLM-trained models lack autoregressive decoding, so they can’t generate fluent text token by token. Use a generative model for chatbots and reserve MLM-trained encoders for retrieval or classification tasks behind the scenes.

Sources

Devlin et al.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - The original paper introducing masked language modeling as a pre-training objective
Hugging Face Docs: Summary of the Models — Transformers Documentation - Reference for MLM-based model architectures and their capabilities

Expert Takes

MONA

Masked language modeling works because it forces bidirectional representation learning. By hiding tokens and requiring prediction from surrounding context in both directions, the model builds embeddings that capture syntactic and semantic relationships simultaneously. This bidirectional constraint is precisely why encoder models outperform autoregressive alternatives on classification and retrieval benchmarks — the representation is richer because no context direction is blocked during training.

MAX

If you’re building a retrieval pipeline or classification service, your first question should be which encoder to fine-tune — not whether to use one. MLM-trained encoders slot into existing architectures with minimal overhead. Pair one with a vector database for semantic search, or fine-tune on labeled examples for classification. The practical advantage is speed and cost: encoders are smaller and cheaper to run at inference time than full generative models.

DAN

Every major search engine, enterprise tool, and recommendation system still runs encoder models pre-trained with masked language modeling. The industry spotlight moved to generative models, but the infrastructure underneath relies heavily on MLM-trained encoders for understanding tasks. Companies betting entirely on autoregressive architectures for classification and retrieval are overpaying for capability they don’t need. The smart move is matching the training objective to the job.

ALAN

There’s a quiet assumption baked into masked language modeling: that predicting missing words from context is a good proxy for understanding language. But whose language, and whose context? Models trained this way absorb whatever patterns dominate their training data — including biases, cultural defaults, and blind spots. When these encoders power hiring tools, content moderation, or legal document analysis, the question isn’t just accuracy. It’s whose version of normal the model learned to reconstruct.

Back to Glossary