Next Token Prediction

Also known as: next-word prediction, causal language modeling, autoregressive prediction

Next Token Prediction: A training method where a language model learns to predict the next token in a sequence based on all preceding tokens, forming the core objective behind decoder-only transformer architectures like GPT and Claude.

Next token prediction is a training method where a language model learns to predict the most probable next word or subword in a sequence, using only the tokens that came before it.

What It Is

Every time you type a message into ChatGPT or Claude and get a response, the model is generating text one piece at a time — predicting what should come next based on everything it has read so far. That process is next token prediction, and it is the single training objective that powers virtually every modern large language model.

Put simply, a “token” is a chunk of text — sometimes a whole word, sometimes a word fragment, sometimes punctuation. During training, the model reads massive amounts of text and practices one task: given a sequence of tokens, guess what token comes next. According to Radford et al., the formal objective is to maximize the probability of each token given all the tokens before it, using a loss function called cross-entropy. The model gets better at this game by adjusting its internal parameters until its predictions closely match real text.

What makes this work is a mechanism called causal masking. According to Vaswani et al., causal masking prevents the model from “peeking” at future tokens during training — it can only look backward. Think of it like reading a novel with every page after your current one glued shut. You can only guess the next sentence from what you’ve already read. This constraint is what makes next token prediction autoregressive: the model generates one token, appends it to the input, then predicts the next one, and repeats.

This single objective — predict the next token — turns out to be surprisingly powerful. By training on enough text with enough parameters, models learn grammar, facts, reasoning patterns, and even code structure, all from this one task. It is the reason decoder-only architectures (which are built around this objective) have become dominant over encoder-decoder designs, especially as models scale up. The simplicity of one unified objective means every parameter in the network contributes to the same goal, which improves data efficiency and scaling behavior.

How It’s Used in Practice

When you ask an AI assistant to draft an email, summarize a document, or answer a question, the model is running next token prediction in real time. It takes your prompt as the starting sequence, then generates a response token by token. Each new token feeds back into the model as context for predicting the one after it. This is why you sometimes see AI responses appear word by word in a streaming interface — that is the autoregressive loop in action.

The same mechanism powers code completion in tools like Cursor and GitHub Copilot. As you type code, the model predicts the most likely next tokens — function names, arguments, closing brackets — based on the code written so far. It is also why models can follow instructions: during training on instruction-formatted text, the model learned that certain prompt patterns predict certain response patterns.

Pro Tip: If a model gives you an answer that starts strong but drifts off topic halfway through, that is a practical consequence of next token prediction. Each token is conditioned on the growing context, and errors can compound. Breaking long requests into shorter, focused prompts helps keep the predictions on track.

When to Use / When Not

Scenario	Use	Avoid
Generating free-form text (emails, summaries, creative writing)	✅
Filling in missing words in the middle of a sentence		❌
Code completion where you need the next line or block	✅
Tasks requiring bidirectional context (e.g., named entity recognition)		❌
Open-ended conversation and question answering	✅
Exact retrieval from a database or structured query		❌

Common Misconception

Myth: Next token prediction means the model just picks the single most likely word every time, making output deterministic and repetitive. Reality: The model produces a probability distribution over all possible tokens. Sampling strategies — temperature, top-k, top-p — control how much randomness is introduced when selecting from that distribution. Higher temperature makes output more creative; lower temperature makes it more focused. The same prompt can produce different outputs each time.

One Sentence to Remember

Next token prediction is the deceptively simple idea that teaching a model to guess the next word, repeated at scale across billions of examples, is enough to produce language understanding, reasoning, and generation — and it is the core reason decoder-only architectures won the scaling race.

FAQ

Q: Is next token prediction the same as autoregressive generation? A: They are closely related. Next token prediction is the training objective — how the model learns. Autoregressive generation is the inference process that applies that learned ability repeatedly to produce a full response.

Q: Why did this approach win over encoder-decoder models? A: One unified objective means every model parameter serves the same task. Doubling parameters directly improves the one thing the model is trained to do, with no split between encoding and decoding goals.

Q: Can next token prediction handle tasks beyond text, like images or code? A: Yes. Any data that can be broken into sequential tokens — code, music notation, even pixel patches — can be modeled this way. The principle is the same: predict the next element from the preceding ones.

Sources

Radford et al.: Improving Language Understanding by Generative Pre-Training - The GPT-1 paper that established next token prediction as a viable pre-training objective for language understanding
Vaswani et al.: Attention Is All You Need - Introduced the transformer architecture and causal masking mechanism that enables autoregressive generation

Expert Takes

MONA

Next token prediction works because it forces a lossless compression of training data. To predict the next token accurately, the model must build internal representations of syntax, semantics, and even world knowledge. The cross-entropy objective rewards the model for every bit of predictive accuracy it gains, making scaling straightforward — more data and more parameters translate directly into better predictions.

MAX

If you are building anything with an LLM — agents, assistants, or retrieval pipelines — next token prediction sets a hard boundary on what the model can do. It generates forward, one token at a time. It cannot go back and revise. That means your prompt design has to front-load the right context, because the model’s entire output quality depends on what it sees before the first generated token.

DAN

The entire LLM industry runs on a training objective you can explain in one sentence. That is the strategic insight. Companies that grasped this early — that scaling a simple objective beats engineering a complex one — built the models that dominate today. Decoder-only architectures won not because they are smarter, but because next token prediction made scaling the obvious bet.

ALAN

There is a tension worth sitting with: next token prediction optimizes for plausibility, not truth. The model learns which tokens are statistically likely to follow other tokens — not whether the resulting sentence is factually correct. Every hallucination is a perfectly confident next token prediction. That gap between fluency and accuracy defines the core reliability challenge we face with these systems.

Back to Glossary