Constrained Decoding

Also known as: grammar-constrained generation, structured decoding, token masking

Constrained Decoding
Constrained decoding is an inference-time technique that filters which tokens an LLM can generate at each step using a grammar or schema mask, guaranteeing the output is structurally valid — most often applied to produce reliable JSON from language models.

Constrained decoding is an inference-time mechanism that blocks invalid tokens during LLM generation, ensuring every output conforms to a target grammar or JSON schema without relying on the model’s instruction-following.

What It Is

Language models, even well-prompted ones, can produce invalid JSON — an unclosed bracket, a stray sentence before the opening brace, a field typed as a string when the schema requires a number. These failures are manageable in development but costly in production, where downstream code expects a clean, parseable document. Constrained decoding eliminates this failure mode by enforcing structure at the token level rather than in a post-processing step.

Every time an LLM generates output, it first scores its full vocabulary — all the possible next characters or subwords — and selects the next token based on those scores. Normally, any token in the vocabulary can win that selection. Constrained decoding intercepts the process before that final pick. A validity mask is applied: every token that would violate the target grammar at the current position gets its score set to zero. Only tokens that keep the partial output on a valid path remain eligible.

Think of it as a gate at each position in the output: the model’s scoring tendencies stay intact, but any token that would break the format is simply not available. The model isn’t making different choices — the choices that would produce malformed output don’t exist.

For JSON specifically, the target grammar is derived from a JSON Schema. Libraries like outlines and xgrammar compile that schema into a finite-state automaton — a compact representation of every valid token sequence. At each decoding step, the automaton’s current state tells the system exactly which tokens are legal next. This check is fast, which is why constrained decoding is now practical for production inference without meaningful latency penalties.

This is what separates constrained decoding from structured output prompting. Prompting asks the model to format its response and depends on instruction-following. Constrained decoding removes the option to produce invalid output entirely. The two can work together: prompting shapes the content, while constrained decoding ensures the structure is always valid.

How It’s Used in Practice

The most common scenario: a developer needs to extract structured data from unstructured text — product attributes, entity types, classification labels — and pipe it into a database or API call. Without constrained decoding, the pipeline needs retry logic that catches JSON parse failures and re-prompts the model. With it, every response is schema-valid by construction and the retry loop disappears.

Structured output libraries like outlines and xgrammar handle the integration. You define a JSON Schema or Pydantic model, pass it to the library, and run inference as usual. The constraint applies during sampling, invisibly to the model. Many inference frameworks also support server-side constrained decoding: the schema travels with the request and the server applies the mask before returning each token to the caller.

Pro Tip: Constrained decoding guarantees structural validity, not semantic correctness. A required field will always appear in the output — but the value inside can still be a hallucination or an incorrect extraction. Do not skip content validation just because the schema was enforced at the decoding layer.

When to Use / When Not

ScenarioUseAvoid
Extracting structured data from text for downstream parsing
Output schema changes dynamically per user request at runtime
Generating free-form text — summaries, chat responses
Running inference at scale with a fixed JSON Schema
Caller code parses the response with no fallback logic
The model needs to express uncertainty in natural language

Common Misconception

Myth: Constrained decoding makes the model’s answers more accurate or reliable.

Reality: It only guarantees that the output conforms to the target format. Fields will be present and correctly typed — but their values can still be hallucinations, incorrect extractions, or confident guesses on ambiguous inputs. Structural compliance and factual accuracy are separate concerns. Constrained decoding handles the first; prompt design and output verification handle the second.

One Sentence to Remember

Constrained decoding is a format guarantee, not a fact guarantee — the output will always match the schema, but the values inside are still the model’s best attempt at the truth. For pipelines where a schema-valid but factually wrong response causes downstream damage, pair it with a verification step on the values themselves.

FAQ

Q: Is constrained decoding the same as structured output prompting? A: No. Structured output prompting instructs the model to format its response. Constrained decoding enforces the format at the token sampling layer, making invalid output impossible rather than just unlikely.

Q: Does constrained decoding slow down inference? A: A small overhead is added to compute which tokens are valid at each step. Modern implementations minimize this to near-zero for standard JSON schemas, making the latency cost negligible in most production settings.

Q: Can constrained decoding handle formats other than JSON? A: Yes — any format expressible as a context-free grammar: regex patterns, YAML structures, custom domain-specific languages. It cannot enforce semantic constraints like requiring a date field to contain a future date.

Expert Takes

Each decoding step produces a probability distribution across the model’s full vocabulary. Constrained decoding applies a validity mask derived from parsing the partial output against the target grammar — tokens that would cause a parse failure are set to zero probability. The grammar is typically compiled to a finite-state automaton first, making the per-token check fast enough for production use. Format compliance stops being a statistical outcome and becomes a mathematical certainty.

If downstream code parses an LLM’s JSON response, constrained decoding moves the failure mode earlier and makes it structural. Instead of catching parse errors at runtime, you get a schema-valid document every time the model responds. In practice, this removes the retry-on-parse-failure loop from the application layer. The output contract at the call site becomes something the rest of the system can depend on unconditionally.

Most teams start with structured output prompting because it needs no infrastructure change. Constrained decoding is the upgrade you reach for when prompt-based approaches start failing in production — typically when call volumes grow large enough that even a small parse failure rate creates a manual-intervention backlog. The latency overhead is now minimal enough that the reliability trade-off is nearly always worth it at scale.

Constrained decoding resolves format compliance but introduces its own risk: the validity mask can suppress tokens that would have expressed uncertainty or refusal. A model that cannot produce valid JSON saying “I don’t know” is forced to fill required fields anyway. Format validity and epistemic honesty are separate properties. That gap deserves attention before constrained outputs feed automated decision pipelines where a confident-looking wrong answer is more dangerous than no answer.