XGrammar

Also known as: XGrammar-2, grammar-constrained decoding, constrained generation backend

XGrammar: XGrammar is an open-source library for constrained LLM decoding. It enforces grammar constraints at the token level — blocking invalid tokens during sampling — to guarantee outputs conform to a JSON Schema or formal grammar, with negligible impact on generation latency.

XGrammar is an open-source constrained decoding library that enforces grammar rules at the token level during LLM generation, guaranteeing structurally valid JSON Schema output with near-zero latency overhead.

What It Is

Building a reliable structured output pipeline — one that returns valid JSON every single time — is harder than prompt engineering alone can guarantee. You can ask a model to produce JSON; you cannot force it. Missing brackets, wrong field types, extra keys: any of these breaks downstream code silently or loudly. Retry logic that fires on parse failures introduces its own problems — latency spikes, chained failures in multi-step agent workflows, and application code that handles a class of error that should not exist at all. XGrammar eliminates that class of error at the source, enforcing structure at the generation level rather than catching it afterward.

Think of XGrammar’s enforcement mechanism like a keypad lock on each token slot. Before the model selects the next token, a grammar-derived mask eliminates every choice that would make the output structurally invalid. The model never samples from invalid options — it picks only from valid continuations the grammar allows at that position. Structural correctness becomes a property of the generation process itself, not of the model’s willingness to follow a prompt.

According to MLC AI Blog, XGrammar-2 (released May 2026) achieves a roughly 80× speedup in grammar compilation compared to its predecessor, with constraint processing under 40 microseconds per token. At those speeds, grammar enforcement adds negligible overhead to generation time — the model inference itself remains the bottleneck, not the mask computation.

XGrammar-2 also introduces a domain-specific language called “Structural Tag” — a JSON-based format for defining composable agent output structures. Rather than writing a bare JSON Schema, developers can specify tagged sequences, named variants, and nested schemas in a single declarative spec. This matters in multi-step agent pipelines where the expected output shape needs to vary by task context without requiring separate schema definitions per step. According to XGrammar GitHub, the library supports bindings for Python, C++, Rust, and JavaScript, and is released under the Apache-2.0 license — making it possible to integrate constrained decoding across diverse serving environments.

How It’s Used in Practice

Most developers encounter XGrammar not directly, but as the inference backend that makes tools like Instructor or BAML produce reliable output. When you define an output schema in an Instructor BaseModel or a BAML function signature and send a request to a vLLM or SGLang inference server, XGrammar is what enforces that schema at the token level. According to MLC AI Blog, it is the default structured generation backend for vLLM, SGLang, TensorRT-LLM, and MLC-LLM.

For structured output pipelines, this shifts the guarantee to the right layer. Instead of wrapping every API call in try/except and implementing retry logic for parse failures, the pipeline receives a structurally valid response on the first attempt. The schema validation logic in your application code becomes about checking values — not checking whether the JSON parsed at all.

Pro Tip: If your structured output pipeline runs on vLLM or SGLang as the inference server, XGrammar is likely already active. Check the server startup logs or configuration for guided_decoding_backend: xgrammar. For local development without a full inference server, install the xgrammar Python package directly and pre-compile the token mask from your JSON Schema before inference begins — this avoids recompiling the grammar on every request.

When to Use / When Not

Scenario	Use	Avoid
Downstream code breaks when JSON has missing or unexpected fields	✅
Agent loop where one malformed output halts the entire chain	✅
Free-text generation (summaries, explanations, creative content)		❌
Streaming output that must be progressively parseable at each token	✅
Inference via a hosted API that does not support guided decoding		❌

Common Misconception

Myth: XGrammar adds significant latency because it processes grammar constraints on every generated token. Reality: The overhead is negligible. According to MLC AI Blog, XGrammar-2 processes constraints in under 40 microseconds per token — orders of magnitude below the time the model itself takes to generate each token. Generation time is dominated by model inference, not grammar enforcement.

One Sentence to Remember

XGrammar moves the guarantee of structurally correct output from your retry logic to the generation step itself — meaning invalid JSON stops being a failure mode your pipeline needs to handle and becomes something that simply does not occur.

FAQ

Q: Does XGrammar work with any LLM, or only specific models? A: XGrammar operates at the inference server level, not inside the model. It requires a compatible serving framework — vLLM, SGLang, or TensorRT-LLM. Direct integration with hosted APIs like OpenAI or Anthropic is not supported.

Q: Is XGrammar the same as enabling JSON mode in an LLM API? A: No. JSON mode asks the model to produce JSON — it does not guarantee it. XGrammar enforces valid tokens at each decoding step, so structural correctness is a property of the generation process, not a request directed at the model.

Q: What output formats does XGrammar support beyond JSON Schema? A: According to MLC AI Blog, XGrammar supports JSON Schema, regular expressions, EBNF grammars (Extended Backus-Naur Form — a standard grammar notation language), and the Structural Tag DSL introduced in XGrammar-2 for composable agent output structures.

Sources

MLC AI Blog: XGrammar-2: Fast and Customizable Structured Generation - primary reference for XGrammar-2 performance, features, and backend integration status
XGrammar GitHub: mlc-ai/xgrammar — Fast, Flexible and Portable Structured Generation - source code, language bindings, and Apache-2.0 license

Expert Takes

MONA

Constrained decoding is the correct approach to structured output — not probabilistic coaxing through prompt design. A token mask converts a grammar into hard constraints at sampling time, making syntactic correctness a structural property of generation. Most invalid outputs fail at the token boundary, not somewhere deep in semantic space. Enforcing constraints there costs far less than retrying full inferences. Encode the invariant, not the preference.

MAX

A structured output pipeline breaks in two places: the model ignores your schema, or your parser chokes on the response. XGrammar removes the first failure mode entirely. Enforcement at the token level means the schema in your Instructor model or BAML function signature becomes a hard contract, not a request. Your integration code can skip defensive JSON parsing and go straight to business logic. The inference server handles correctness; you handle what to do with valid data.

DAN

The “just retry if the JSON breaks” approach has a hidden cost — latency spikes and chained failures when retries pile up. Production pipelines that depend on structured LLM output need a different contract: not best-effort JSON, but guaranteed JSON. That contract now exists in the inference layer, not as a wrapper library. Teams that wire this in early eliminate a whole category of runtime errors before they hit production. That’s not a nice-to-have. That’s table stakes for anything shipping.

ALAN

Guaranteed valid JSON sounds like a purely technical win — and technically, it is. But the reliability it creates shifts accountability. When invalid outputs disappear as a failure mode, remaining failures become harder to dismiss as “the model was unpredictable.” Constrained decoding does not solve the problem of wrong-but-valid answers: a well-formed JSON object with plausible-sounding hallucinated values passes every schema check. Infrastructure reliability can create a false sense of correctness. Valid structure is not the same as accurate content.

Back to Glossary