Nemotron-H

Also known as: NVIDIA Nemotron-H, Nemotron hybrid, Nemotron Mamba-Transformer

Nemotron-H: NVIDIA’s family of hybrid Mamba-Transformer language models that replaces most self-attention layers with Mamba-2 state space layers, cutting inference cost at the same accuracy target while keeping a small number of attention layers for precise long-context recall.

Nemotron-H is NVIDIA’s family of hybrid Mamba-Transformer language models that replaces most self-attention layers with Mamba-2 state space layers to cut inference cost at the same accuracy target.

What It Is

Self-attention cost grows with the square of sequence length, so a model that handles a chat message cheaply chokes on a hundred-thousand-token document. Nemotron-H is NVIDIA’s answer. Rather than shrinking the model or the context window, NVIDIA changed the architecture: most self-attention layers are replaced with Mamba-2 layers — state space layers that run at constant compute and memory per generated token — and a handful of attention layers stay behind to preserve the precise recall behavior attention is known for. Think of it as a highway with a few toll gates: the toll gates are expensive, so you keep only the ones that actually earn their cost and let the rest of the traffic flow freely.

The original Nemotron-H release shipped two base models at different scales. According to NVIDIA ADLR, the family includes 8B and 56B base models, plus a compressed 47B variant derived from the 56B through NVIDIA’s MiniPuzzle pruning-and-distillation technique. The 8B internal layout tells the story clearly: according to NVIDIA ADLR, the 8B is built from 24 Mamba-2 layers, 24 MLP layers (multi-layer perceptron blocks that mix features within a token), and only 4 self-attention layers. Attention is the minority; the heavy sequence-mixing work happens in the Mamba-2 layers. Training ran at an unusually large scale for a hybrid — according to NVIDIA ADLR, the 56B was pre-trained on 20T tokens using FP8 per-tensor scaling (a low-precision numeric format that cuts memory use during training), and the 8B on 15T tokens. Data budget matters here because hybrid behaviors only emerge clearly when the model is trained on the same kind of scale as a pure Transformer of similar size.

The payoff is inference speed at parity. According to NVIDIA ADLR, Nemotron-H runs up to roughly three times faster than similarly-sized Qwen-2.5 and Llama-3.1 models at matched accuracy. That speed comes from the same idea behind every serious 2025–2026 hybrid: linear-time state propagation for most layers, exact attention for the few places recall really matters. NVIDIA later extended the recipe to reasoning workloads — according to NVIDIA ADLR, Nemotron Nano 2 is a reasoning-tuned hybrid Mamba-Transformer built on the same foundation, released in August 2025. For readers tracking state space models, Nemotron-H is one of the clearest signals that the SSM research line has reached production.

How It’s Used in Practice

Most people encounter Nemotron-H through an inference endpoint or a model card in a serving stack, not through training code. The difference shows up in cost and latency numbers rather than the prompt interface. Teams deploying Nemotron-H tend to pick it for workloads where context runs long: multi-document question answering, long transcripts, large code contexts, extended chat sessions. According to NVIDIA ADLR, the roughly three-times throughput gain over Qwen-2.5 and Llama-3.1 at comparable accuracy translates into lower serving cost per request on those long-context workloads.

For builders, the practical caveat is cache behavior. Transformer serving stacks assume prefix-cache reuse across turns for identical system prompts; the Mamba-2 portion of a Nemotron-H model carries a running state that does not reuse the same way an attention KV cache does (the key-value cache Transformers use to avoid recomputing earlier tokens). Throughput numbers from a pure Transformer deployment rarely carry over unchanged.

Pro Tip: Before replacing a Transformer in a production pipeline, run Nemotron-H on prompts that depend on exact recall from the middle of a long document. The handful of attention layers is what preserves that behavior, but the attention-to-SSM ratio differs between the 8B and 56B, and your workload may lean on precisely the retrieval patterns that a hybrid handles unevenly.

When to Use / When Not

Scenario	Use	Avoid
Serving long documents or full codebases to a foundation model end-to-end	✅
Short chat prompts where first-token latency is the dominant metric		❌
GPU memory is the binding constraint on how much context you can fit	✅
Workload depends on aggressive prefix-cache reuse across identical system prompts		❌
Running on NVIDIA-native serving stacks where hybrid support is already mature	✅
Task is short-form retrieval where a small Transformer already fits comfortably		❌

Common Misconception

Myth: Nemotron-H is a pure Mamba model — NVIDIA dropped self-attention entirely in favor of state space layers. Reality: Nemotron-H is a hybrid. According to NVIDIA ADLR, the 8B model keeps 4 self-attention layers alongside its 24 Mamba-2 layers and 24 MLP layers, precisely so exact recall and copy behaviors survive. The goal is not to eliminate attention but to use it only where it earns its quadratic cost.

One Sentence to Remember

Nemotron-H is one of the clearest production examples of the state space model research line reaching scale: a hybrid that replaces most self-attention with Mamba-2 layers, keeps a few attention layers for precise recall, and shows that the cost gap between pure-Transformer and hybrid inference is now large enough to matter on real workloads.

FAQ

Q: What is Nemotron-H? A: Nemotron-H is NVIDIA’s family of hybrid Mamba-Transformer language models. It replaces most self-attention layers with Mamba-2 state space layers to cut inference cost while keeping a few attention layers for precise long-context recall.

Q: Is Nemotron-H a pure Mamba model? A: No. It is a hybrid. According to NVIDIA ADLR, the 8B base model contains 24 Mamba-2 layers, 24 MLP layers, and 4 self-attention layers, so attention is reduced but not removed.

Q: How is Nemotron-H different from a standard Transformer? A: A standard Transformer uses self-attention in every layer. Nemotron-H replaces most of those attention layers with Mamba-2 layers that run at constant compute per token and keeps only a handful of attention layers.

Sources

NVIDIA ADLR: Nemotron-H: A Family of Accurate, Efficient Hybrid Mamba-Transformer Models - NVIDIA’s official research page with architecture details and inference benchmark comparisons
NVIDIA ADLR: Nemotron-H paper (arXiv:2504.03624) - the April 2025 paper describing the 8B, 56B, and MiniPuzzle-compressed 47B variants

Expert Takes

MONA

The interesting claim in Nemotron-H is not that Mamba-2 layers run faster — that is arithmetic — but that you can replace most attention layers with state space layers at a large training scale and still match a pure Transformer on accuracy. The design keeps a small number of attention layers precisely where exact copy behavior is needed. Not free. Careful architectural surgery plus a heavy training run to prove the surgery worked.

MAX

For anyone designing inference pipelines, Nemotron-H forces one assumption to be retested: prefix caching. Transformer serving stacks are tuned around KV cache reuse across turns, and the Mamba-2 portion of a hybrid does not carry the same shape of state. If your context-engineering workflow depends on long system prompts reused aggressively across requests, benchmark the hybrid on real traffic before swapping it in. The architecture matters, but so does the spec for serving it.

DAN

NVIDIA shipping a hybrid at this scale is a market signal, not a science project. When the company that sells the GPUs Transformers run on ships its own hybrid line, the read is clear: quadratic attention is losing ground to linear-time state. If your roadmap assumes a pure-Transformer serving stack through the next cycle, you are planning around economics the rest of the field has already walked away from.

ALAN

Hybrid speedups at parity accuracy are an average, and averages hide distribution tails. The Mamba-2 portion of a model does not recall long-range exact tokens the way attention does, and the handful of remaining attention layers is only sized so generously. For some retrieval patterns the drop shows up quietly — recall from the middle of a long document, chained reasoning across many references. Ask what was benchmarked and what was not before accepting the throughput headline.

Back to Glossary