Jamba

Also known as: Jamba 1.5, Jamba Hybrid, AI21 Jamba

Jamba
Jamba is AI21 Labs’ family of hybrid language models that interleave Transformer attention layers with Mamba state-space layers and Mixture-of-Experts routing, giving them Transformer-grade recall at a fraction of the attention cost for long-context workloads.

Jamba is AI21 Labs’ family of hybrid language models that interleave Transformer attention with Mamba state-space layers and Mixture-of-Experts routing, built to handle very long contexts at production scale.

What It Is

Transformers handle recall well but pay a quadratic cost as context grows, which quickly becomes the ceiling for whole-document or multi-document workloads. Pure state-space models like Mamba are cheap at long lengths but compress history aggressively, which can cost precision on tasks that require pulling an exact line out of a long passage. Jamba exists to split the difference: keep attention where it earns its cost, and push everything else to a cheaper sequence processor.

The architecture is defined one block at a time. A useful mental picture: imagine a team reading a long document, where most members keep a running summary and a few can flip back to the original pages when the summary falls short — Mamba layers play the summarizer role, attention layers play the flip-back role. Each Jamba block contains either an attention layer or a Mamba layer, followed by an MLP, and the blocks are arranged in a fixed pattern. According to Lieber, Lenz et al., the pattern places one attention layer for every eight total layers, so the model gets sparse attention rather than attention everywhere. A subset of the MLPs is replaced with Mixture-of-Experts layers, which routes each token through only a small fraction of the network’s parameters even though the full parameter count is much larger.

The current series is Jamba 1.5. According to AI21 Research, it ships in two sizes — Jamba 1.5 Mini with 52B total and 12B active parameters, and Jamba 1.5 Large with 398B total and 94B active — and both support a 256K-token context window. The paper was accepted at ICLR 2025, and AI21 releases the models under its Jamba Open Model License for research and many commercial uses.

How It’s Used in Practice

Product teams reach for Jamba when the workload is dominated by long inputs rather than long chat turns. Typical scenarios are analyzing full-length legal contracts or regulatory filings, running retrieval-augmented generation over large enterprise knowledge bases without aggressive chunking, processing long meeting or interview transcripts, and driving agentic workflows that accumulate many turns of tool calls and observations before a final answer. In each case, the bottleneck on a pure transformer is either the context window or the latency and cost of paying full attention across it.

The deployment pattern is usually “evaluate against your current transformer on your own long-context data” rather than a blind swap. Teams compare recall on needle-in-a-haystack tasks and end-to-end quality at realistic context lengths, then factor in latency and per-token cost. The right choice depends on which tradeoff hurts your workload.

Pro Tip: Before you build a Jamba integration into your stack, run your own long-context evaluation on the specific retrieval patterns your product depends on — hybrid models can win decisively on some and lose surprisingly on others, and the published benchmarks rarely look exactly like your workload.

When to Use / When Not

ScenarioUseAvoid
Long-document analysis that blows past your current transformer’s context window
Short-prompt conversational chat with a few thousand tokens of context
Cost-sensitive RAG over large enterprise corpora with full-document prompts
Tasks requiring pixel-perfect token-by-token recall of every detail in the input
Agentic pipelines that accumulate many turns of tool calls and observations
Projects that need the broadest possible tooling and fine-tuning ecosystem today

Common Misconception

Myth: Jamba replaces Transformer attention with Mamba, so it is essentially a “pure SSM” model with a different name. Reality: Jamba keeps attention in every block group — it just makes attention sparse, appearing in roughly one layer out of every eight. The model is deliberately hybrid, and AI21’s own ablations show that removing the remaining attention layers noticeably hurts recall. The whole point is the mix, not the replacement.

One Sentence to Remember

Jamba is the early production proof that you do not have to choose between attention’s precision and state-space efficiency — interleave them deliberately, add Mixture-of-Experts for parameter scale, and you get a model that handles book-length context without paying attention’s full quadratic bill.

FAQ

Q: How is Jamba different from Mamba? A: Mamba is a pure state-space architecture. Jamba is a family of production models that combine Mamba layers with Transformer attention layers and Mixture-of-Experts routing inside each block, rather than replacing attention outright.

Q: What context window does Jamba support? A: According to AI21 Research, both Jamba 1.5 Mini and Jamba 1.5 Large support a 256K-token context window, among the largest in open-weight production-grade language models.

Q: Is Jamba open source? A: AI21 releases Jamba under its Jamba Open Model License, which covers research and many commercial uses but includes some restrictions. Check the current license text before deploying Jamba 1.5 Large in a production product.

Sources

Expert Takes

Not a Mamba replacement for attention. A composition of two layer types with complementary failure modes. Attention pays the full cost for everything it remembers; selective state-space layers compress aggressively but drop hard-to-anticipate detail. A hybrid block exploits that complementarity — a few attention layers act as sparse lookups over the compressed state, letting the model recover what the state-space layer had to discard. That is architectural composition, not ensemble magic.

If you are writing a context file for a Jamba-based agent, plan for different retrieval costs than you would for a pure transformer. Attention happens only at certain layers, so stuffing identical boilerplate across a giant prompt wastes tokens without the compensating recall. Structure your context as distinct chunks with clear boundaries, and put the facts the agent actually needs to recall near section markers the model can latch onto. Write the spec accordingly.

The pure-transformer era is ending. Jamba is one of the early signals, not the final word. The production question is no longer “is attention better than state-space?” — it is “which hybrid recipe wins at scale?” Expect every frontier lab to ship some variant of interleaved attention and linear-time layers in the next generation. Teams still planning long-context roadmaps on pure transformer scaling laws are planning for a world that already shifted under them.

Who bears responsibility when a hybrid model silently discards context a pure transformer would have kept? State-space layers compress by design, and that compression is not transparent to the user. If a long-context assistant drops a consent clause, a policy exception, or a safety-relevant instruction because the state-space layer decided it was not worth preserving, the user sees no error — just confident output. Architectural choices like this are governance choices.