Speculative Decoding

Also known as: spec decoding, assisted generation, draft-verify decoding

Speculative Decoding
An inference acceleration technique where a small draft model proposes multiple candidate tokens that a larger target model verifies in parallel, reducing latency while preserving output quality identical to standard generation.

Speculative decoding is an inference technique where a small draft model proposes tokens and a larger model verifies them in parallel, delivering faster generation without changing the output.

What It Is

Large language models generate text one token at a time. Each token requires a full forward pass through billions of parameters, and most of that compute is spent waiting — the GPU reads model weights from memory, performs a calculation, then repeats the whole cycle for the next token. This sequential bottleneck is a core part of the memory wall problem that limits LLM inference speed, regardless of hardware power.

Speculative decoding attacks this bottleneck with a direct idea: let a smaller, faster model do the guessing, and let the big model check the guesses in bulk.

Think of a senior editor reviewing articles. Instead of writing every sentence, they hand the task to a junior writer who quickly drafts several paragraphs. The editor reviews them all at once — accepting what fits, correcting the rest. Same quality, faster throughput, because reviewing is cheaper than writing from scratch.

In technical terms, the process works in three steps. First, a lightweight draft model (often tens of millions of parameters versus billions in the target) generates a batch of candidate tokens one after another — this is fast because the draft model is small. Second, the target model processes all candidate tokens in a single pass, computing the probability it would have assigned to each one. Third, a statistical check called rejection sampling compares the draft and target probability distributions token by token: tokens where the draft matches what the target would have chosen get accepted, and the first mismatched token gets resampled from an adjusted distribution.

The mathematical guarantee here matters. According to Google Research, speculative decoding produces an output distribution identical to standard autoregressive decoding (the default one-token-at-a-time method). The big model’s quality is fully preserved — you get the same answers faster, not different answers.

The original technique, published by Leviathan, Kalman, and Matias at Google Research in 2022, showed roughly three-times latency reduction. Since then, variants have evolved. According to NVIDIA Blog, EAGLE-3 attaches a lightweight prediction head directly to the target model’s layers, removing the need for a separate draft model entirely.

How It’s Used in Practice

If you use an AI assistant that responds noticeably faster than it did a year ago — whether that’s a coding tool, a chatbot, or AI-powered search — speculative decoding is likely part of the reason. The technique runs behind the scenes in inference servers. Major serving frameworks like vLLM, SGLang, and TensorRT-LLM all support it as a built-in optimization. Google deploys it in production for AI Overviews in Search, where response latency directly affects user experience.

For teams running their own models, speculative decoding stacks with other optimizations. According to Prem AI Blog, combining it with FP8 quantization on compatible hardware produced a total speedup exceeding three times on a large Llama model — showing how it fits into a broader toolkit alongside continuous batching and hardware-specific kernels.

Pro Tip: If you’re evaluating an inference provider or self-hosting, ask whether speculative decoding is enabled. It’s often just a configuration flag — turning it on can cut time-to-first-token and generation latency with zero impact on output quality.

When to Use / When Not

ScenarioUseAvoid
Latency-sensitive applications (chatbots, search, coding assistants)
Large target model where a compatible draft model or prediction head exists
Batch processing focused on throughput where latency is irrelevant
Small models that already generate tokens quickly enough
Stacking with quantization for maximum inference speed gains
Deployments where output distribution must stay mathematically identical

Common Misconception

Myth: Speculative decoding trades output quality for speed — the answers are approximate or degraded. Reality: The rejection sampling step guarantees that accepted tokens follow the exact same probability distribution as standard autoregressive decoding. The target model has final say on every token. You get identical quality at lower latency, not a speed-quality tradeoff.

One Sentence to Remember

Speculative decoding lets a small model draft and a large model verify, cutting generation time without changing a single word of the output — making it one of the most practical techniques for breaking through the inference latency barriers that memory walls and sequential token generation create.

FAQ

Q: Does speculative decoding change the quality of the model’s output? A: No. A rejection sampling algorithm ensures the output probability distribution stays mathematically identical to standard autoregressive decoding. Quality is fully preserved — only generation latency decreases.

Q: Do I need a separate draft model to use speculative decoding? A: Not always. Newer variants like EAGLE-3 attach a lightweight prediction head to the target model itself, eliminating the need for a standalone draft model while keeping the same verification guarantees.

Q: How much faster does speculative decoding make inference? A: Typical speedups range from two to three times faster generation latency, depending on the draft model’s accuracy and the task. Results improve further when combined with quantization.

Sources

Expert Takes

Speculative decoding preserves the target model’s distribution through a rejection sampling proof — every accepted token passes a mathematical test against the target’s own probabilities. The draft model proposes; it never decides. This distinction matters: unlike quantization or pruning, speculative decoding introduces zero approximation error. The output is not “close enough.” It is provably identical to what the target model would have generated alone.

In a serving stack, speculative decoding sits between your model weights and your latency target. The practical win is minimal integration overhead — most major frameworks expose it as a configuration option, not an architectural overhaul. Pair it with continuous batching and quantization, and you address three distinct bottlenecks: per-token latency, throughput under concurrency, and memory footprint. Each optimization targets a different constraint in the inference pipeline.

Inference cost determines who ships AI products and who just demos them. Speculative decoding cuts time-to-first-token and total generation latency without touching model quality — that reshapes the economics of serving large models at scale. Every provider competing on response speed runs some variant of this technique. If your serving stack doesn’t include it yet, you’re spending more for slower responses than your competitors.

The guarantee of identical output distributions is reassuring, but it also clarifies what speculative decoding does not address. Faster generation amplifies whatever biases and failure modes the target model already carries. Speed improvements in inference are real engineering progress — they also mean more content produced faster with the same unresolved safety gaps. Acceleration without matching investment in evaluation widens the gap between what we can generate and what we can verify.