MONA explainer 10 min read April 17, 2026 Updated July 3, 2026

What Is a State Space Model and How Selective SSMs Replace Quadratic Attention

selective state space model hidden state recurrence versus quadratic self-attention on long sequences

ELI5

A state space model processes a sequence by carrying a compressed hidden state forward, updated token by token. Selective SSMs — Mamba and successors — decide per input what to keep or forget, running in linear time instead of quadratic.

Double a transformer’s context window, and its compute cost quadruples. Double it again, and it quadruples again. By the time frontier labs were chasing million-token context, everyone was staring at the same bill — and the math said the only way out was a different kind of model. The escape hatch didn’t come from a cleverer variant of attention. It came from an idea borrowed from linear systems theory, fitted with enough new mechanics to finally train well at scale.

The Idea Attention Was Competing Against

Before selection mechanisms turned SSMs into transformer rivals, the architecture had been sitting in signal processing textbooks for decades. To see why it now works in a language model, you have to see what it is at the level of the recurrence — and why that shape is fundamentally different from attention.

What is a state space model in AI?

A state space model represents a sequence as a continuous dynamical system discretized onto tokens. Instead of computing pairwise interactions between every token and every other token, it maintains a hidden state h that evolves step by step. The update rule is compact:

h_t = A h_{t-1} + B x_t, and y_t = C h_t.

The matrices A, B, and C govern how the past compresses into the present and how the present projects to the output. Every token updates the same hidden state — no token-to-token attention required.

The formulation isn’t new. It was introduced to deep learning by the HiPPO framework and then made trainable at scale by S4, whose structured-plus-low-rank parameterization of A kept very long sequences computationally tractable (S4 paper).

Not storage. Compression.

The bottleneck in the original S4 line was different from attention’s. Attention can see every past token, at quadratic cost. SSMs could run in linear time, but every token updated the hidden state the same way. They compressed information faithfully — they just couldn’t choose what to compress. A useful fact and a filler token went through the same A, the same B. That uniformity was the ceiling.

When the Input Steers Its Own Dynamics

The fix was simple in retrospect. Make the model’s dynamics depend on what it’s reading. Not fixed transitions applied uniformly — input-dependent transitions, token by token.

How does a selective state space model process sequences in linear time?

In a selective SSM, the parameters that govern how the hidden state evolves — Δ (the discretization step), B (the input projection), and C (the output projection) — become functions of the current input instead of fixed weights, per Gu & Dao (Mamba paper). The model reads each token, computes what portion of that token to let into the state, and modulates how quickly older information fades. The effect is a per-token forget gate and a per-token input gate, baked directly into the recurrence.

That alone wouldn’t be fast. Input-dependent dynamics break the convolutional shortcut that made S4 practical to train. The recurrence becomes genuinely sequential — and sequential is the enemy of GPUs.

Mamba’s second trick is a hardware-aware parallel associative scan. Per Gu & Dao (Mamba paper), it fuses parameter loading, discretization, and the recurrence itself into a single kernel that minimizes traffic between HBM and SRAM, and runs an order of magnitude faster than a naïve scan. The recurrence is still sequential in principle; on the hardware, it’s parallelized along the sequence dimension by exploiting the associativity of the update operator.

The analogy that makes this concrete: a transformer is a committee that re-reads every memo on every decision. An SSM without selection is a rolling briefing that updates the same way on every memo, regardless of content. A selective SSM is a briefing where each incoming memo decides, by its own content, whether to update the state strongly, barely, or not at all.

Same linear-time recurrence. Radically different information flow.

Where the Complexity Gap Comes From

Linear versus quadratic isn’t a tuning difference. It comes from the shape of what each architecture stores.

Why does Mamba scale linearly while transformers scale quadratically?

Self-attention computes a similarity between every pair of positions in the sequence. With sequence length n and hidden size d, you pay O(n² d) in time and O(n²) in activation memory during training. Double n, and both quadruple.

An SSM doesn’t store pairwise interactions. It stores a fixed-size hidden state that every token updates in turn. Scaling is linear in sequence length during both training and inference, with inference throughput reported at roughly five times transformer-equivalent models on the original paper’s benchmarks (Gu & Dao, Mamba paper). That headline five-times figure comes from synthetic and long-range evaluations in 2023–24; real-world end-to-end throughput depends on the hybrid ratio, kernel implementation, and whether mixture-of-experts layers are in play. The number is directional, not a blanket guarantee.

The cost of this efficiency is specific. Attention keeps every past token explicitly accessible to every future decision. An SSM keeps a compressed summary. For a fixed hidden size, longer sequences mean more compression pressure on the same state.

Mamba-3, accepted at ICLR 2026, pushes this trade-off further (Mamba-3 paper). A more expressive discretization and complex-valued state updates reach comparable perplexity to Mamba-2 at roughly half the state size. Denser information per unit of memory — same recurrence shape.

	Self-attention	Unselective SSM (S4)	Selective SSM (Mamba)
Sequence complexity	O(n²)	O(n log n) via FFT	O(n) via parallel scan
Per-token dynamics	Uniform, all pairs	Fixed recurrence	Input-dependent recurrence
Explicit token access	Yes (attention weights)	No (compressed state)	No (compressed, selected state)
Training parallelism	Full	Convolutional	Associative scan

Selective state space recurrence updating a fixed-size hidden state token by token versus self-attention computing all pairwise interactions across a sequence — A selective SSM updates a fixed-size hidden state per token in linear time; self-attention computes every pair and scales quadratically.

What the Hybrids Are Telling You

The interesting question is no longer whether SSMs can match transformers. It is which layers of a transformer you can replace without losing the capabilities that attention uniquely provides.

Every production-scale Long Context Modeling model in 2026 is a Hybrid Architecture. Jamba 1.5, from AI21, runs one attention layer for every seven Mamba layers and inserts Mixture Of Experts blocks every two layers. Nemotron-H, from NVIDIA, replaces roughly 92 percent of self-attention layers with Mamba-2 and reports up to 2.9× long-context speedup on the 47B variant versus comparably sized dense transformers (NVIDIA ADLR, Nemotron-H). Falcon H1 uses a different topology entirely: attention heads and SSM heads running in parallel within each block, their outputs concatenated before projection. The pure-recurrent branch — RWKV and its attention-free Linear Attention cousins — keeps the architecture alive at the smaller end of the scale.

Across the zoo, one number keeps recurring. One attention layer per five to seven SSM layers is the de-facto sweet spot (AI21 Blog, Rise of Hybrid LLMs).

If you know the ratio and the state size, you can predict the failure modes:

If the model is mostly SSM, expect strong long-context fluency and weak 5-shot in-context learning.
If the model sits inside that hybrid sweet spot, expect the in-context learning gap to mostly close without sacrificing long-context throughput.
If the sequence is short, the complexity gap shrinks — quadratic attention is cheap when n is small.

Rule of thumb: For short sequences with heavy few-shot prompting, dense attention still pays off. For very long contexts, hybridize aggressively or pay the quadratic compute bill.

When it breaks: Pure SSMs compress history into a fixed-size hidden state instead of keeping explicit token access. On 5-shot MMLU and copy-recall benchmarks, they lag transformers — and the NeurIPS 2025 “Achilles’ Heel of Mamba” paper shows this is a structural property of the compression, not a tuning issue (arXiv, SSM long-context characterization). Hybrids substantially close the gap. They do not fully eliminate it.

The Data Says

Among the long-context models running in 2026, the frontier architectures keep most of their compute in a selective SSM and reserve a thin attention budget for the tasks where explicit token access is non-negotiable. That isn’t the end of the transformer — it is the end of the transformer-only era. Selection turned state space models from a theoretical curiosity into the default long-context substrate.

Sources

Gu & Dao (Mamba paper): Mamba: Linear-Time Sequence Modeling with Selective State Spaces - Selection mechanism (Δ, B, C as input-dependent), linear complexity, hardware-aware parallel scan, 5× throughput figure
Mamba-3 paper: Mamba-3: Improved Sequence Modeling using State Space Principles - Expressive discretization, complex-valued state, comparable perplexity at half the state size
S4 paper: Efficiently Modeling Long Sequences with Structured State Spaces - Structured-plus-low-rank parameterization predecessor to Mamba
NVIDIA ADLR (Nemotron-H): Nemotron-H: A Family of Accurate, Efficient Hybrid Mamba-Transformer Models - ~92% Mamba2 layer replacement, up to 2.9× long-context speedup on the 47B variant
AI21 Blog (Rise of Hybrid LLMs): Attention was never enough: Tracing the rise of hybrid LLMs - Jamba 1.5 architecture, 1:5 to 1:7 attention-to-SSM ratio as production sweet spot
arXiv (SSM long-context characterization): Characterizing State Space Model and Hybrid Language Model Performance with Long Context - “Achilles’ Heel of Mamba” — structural lag on 5-shot MMLU and copy/recall

Aha Moments

MAX

Mona frames this as architecture, but from the specification side, what just happened is that “attention everywhere” quietly stopped being the default. Now you have to declare an attention-to-SSM ratio the same way you’d declare a batch size or a context budget. That’s a design decision, and it belongs in the architecture review, not the appendix. Teams that still write specs assuming a dense transformer are specifying for a model class that’s no longer what people are actually deploying at long context. If your spec talks about long-context behavior without naming the hybrid ratio, the state size, and which layers carry attention, it is incomplete. The fix is one line in the context file: name the architecture, name the ratio, done. Everything downstream — eval plan, cost model, kernel choice — falls out of that single declaration.

DAN

Max treats the hybrid ratio as a spec knob. From the market side, it’s a stampede. Every frontier lab that hit the long-context wall has already retooled, and the ones that haven’t are losing inference margins every week they wait. The pure-transformer era just ended. It didn’t end with a paper. It ended with the production choices of the labs that run the biggest context windows in the real world. The binary is simple. You’re either hybridizing your inference stack now, or you’re paying quadratic compute for capabilities your competitors will undercut before the end of the year. This isn’t about who has the most elegant recurrence mathematics. It’s about who restructures their inference budget fastest. The window on pure-transformer long-context products is closed. What replaces it is still being sorted.

ALAN

Max names the spec. Dan names the market pressure. What neither of them names is what a fixed-size hidden state actually does to a reader’s ability to audit a model’s decision. A transformer, for all its quadratic cost, leaves every token of the input explicitly accessible to the model — you can, in principle, inspect what the model attended to when it produced an answer. A selective SSM compresses history into a state whose contents are a learned summary nobody wrote down. When that state decides, per token, what to forget, it is making policy choices encoded as weights. Regulators will eventually demand explanations for long-context decisions — summaries of contracts, medical notes, legal filings. What happens when the most consequential compression is one nobody voted for, nobody audited, and nobody can fully reverse-engineer from the output?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors