State Space Models

Also known as: SSM, SSMs, selective state spaces

State Space Models: A class of sequence modeling architectures that process input tokens with linear-time complexity, offering a faster alternative to transformers for handling long sequences by maintaining a compressed hidden state instead of attending to every previous token.

State Space Models (SSMs) are sequence modeling architectures that process input with linear-time complexity O(n), replacing the quadratic scaling of transformer attention with a compressed hidden state that tracks relevant context.

What It Is

Transformers process sequences by letting every token attend to every other token. That approach works brilliantly for short and medium texts, but cost grows quadratically — double the sequence length, quadruple the compute. For very long documents, codebases, or audio streams, this wall becomes expensive and slow.

State Space Models solve this by borrowing a concept from control theory. Instead of comparing each token against all previous tokens, an SSM maintains a hidden state — a compact summary of everything it has seen so far. Think of it like reading a novel by keeping a running mental summary rather than flipping back to re-read every prior page each time you encounter a new sentence. The hidden state gets updated with each new token, and predictions flow from that compressed representation.

The mechanism works through learned matrices that define how the hidden state evolves: how the previous state influences the next, how new input enters the state, and how the state maps to output. This lets the model process a sequence in a single forward pass that scales linearly — 10,000 tokens costs roughly ten times what 1,000 tokens costs, not a hundred times.

The most well-known SSM implementation is Mamba, introduced by Albert Gu and Tri Dao in December 2023. According to Gu & Dao, Mamba achieves five times the throughput of comparably sized transformers on long sequences while maintaining competitive quality. The key innovation in Mamba is selective state spaces — the model learns which information to keep in its hidden state and which to discard, rather than applying the same compression rule to every token.

Since then, the architecture has continued to evolve. According to OpenReview, Mamba-3 was published at ICLR 2026 with complex-valued state updates, achieving accuracy gains at the billion-parameter scale over its predecessor while using half the state size.

How It’s Used in Practice

Most people encounter SSMs through hybrid models that combine SSM layers with transformer attention layers. According to AI21 Blog, Jamba is one such hybrid — it blends SSM layers for efficient long-range processing with attention layers that handle tasks requiring precise token-to-token lookup, like retrieval or exact matching. This hybrid approach lets a single model handle both very long contexts and detailed local reasoning without the full quadratic cost.

If you see references to “Mamba-based” or “hybrid SSM” architecture when evaluating an AI tool, it means the model handles longer inputs more cheaply than a pure transformer. For summarizing long documents, processing entire codebases, or running multi-turn conversations with deep history, SSM-hybrid models offer meaningful speed and cost advantages.

Pro Tip: When comparing models for long-context tasks, check whether the model uses a pure SSM, pure transformer, or hybrid architecture. Pure SSMs excel at throughput but may underperform on tasks needing exact recall of a specific detail buried mid-input. Hybrids aim to give you both.

When to Use / When Not

Scenario	Use	Avoid
Processing documents longer than 100K tokens	✅
Tasks requiring exact retrieval of a single fact from a large context		❌
Real-time streaming (audio, sensor data, live logs)	✅
Short prompts under 4K tokens where transformer latency is fine		❌
Throughput-sensitive batch processing on long inputs	✅
Tasks where the ecosystem of transformer-based tooling matters most		❌

Common Misconception

Myth: State Space Models will replace transformers entirely because they are faster. Reality: Speed on long sequences is only one dimension. Transformers still outperform pure SSMs on tasks requiring precise attention over specific tokens — like finding a needle in a haystack. That is why the industry is converging on hybrid architectures using SSM layers for efficient scanning and attention layers for exact recall. The question is not “which wins” but “where each layer type belongs.”

One Sentence to Remember

State Space Models trade the transformer’s ability to look at everything at once for a compressed running summary that scales linearly — and hybrid architectures are already combining both approaches to get the best of each.

FAQ

Q: How do State Space Models differ from transformers? A: Transformers compare every token to every other token (quadratic cost). SSMs maintain a compressed hidden state updated with each new token, achieving linear cost but trading off some precise recall ability.

Q: Is Mamba a State Space Model? A: Yes. Mamba is the most prominent SSM architecture, introduced in 2023. It adds selective gating so the model learns which information to keep or discard in its hidden state, rather than compressing everything uniformly.

Q: Should I choose an SSM-based model over a transformer model? A: For long-context or high-throughput tasks, SSM-based or hybrid models can be significantly faster and cheaper. For short-context tasks or when exact recall matters most, pure transformers remain strong. Many production models now blend both.

Sources

Gu & Dao: Mamba: Linear-Time Sequence Modeling with Selective State Spaces - Original Mamba paper introducing selective state spaces with linear-time complexity
AI21 Blog: Attention Was Never Enough: Tracing the Rise of Hybrid LLMs - Overview of hybrid SSM-transformer architectures and their production rationale

Expert Takes

MONA

State Space Models formalize sequence processing as a linear dynamical system with learned transition matrices. The selective mechanism in Mamba is what bridges the gap between classical linear time-invariant systems and the input-dependent gating that transformers achieve through attention. The mathematical elegance is that you get parallelizable training through convolution-mode computation while retaining recurrent-mode efficiency at inference. Hybrid architectures acknowledge that no single inductive bias covers all tasks.

MAX

If you are building a system that ingests long context — full repositories, legal documents, extended conversations — the architecture choice directly affects your cost-per-query and latency budget. Pure transformers at long context lengths burn compute on token pairs that contribute nothing useful. SSM layers handle the bulk scanning, attention layers handle the precision lookups. Spec your architecture the way you would spec a database: pick the right index type for each access pattern.

DAN

The commercial signal here is clear: whoever ships reliable long-context processing at lower cost wins enterprise contracts. Hybrid models from companies like AI21 are already positioning SSM integration as a competitive differentiator. The market is moving away from “more attention is always better” toward architectures that match compute allocation to actual task requirements. Teams evaluating model providers should ask about architecture mix, not just parameter counts.

ALAN

SSMs introduce a different kind of opacity. A transformer’s attention weights at least show you which tokens influenced a decision. An SSM’s hidden state is a compressed black box — information goes in, gets mixed, and you cannot easily trace which earlier input drove a particular output. According to HiSPA Paper, researchers have already demonstrated hidden state poisoning attacks specific to SSMs. As these architectures enter production, the interpretability gap deserves the same scrutiny we give to attention-based models.

Back to Glossary