DAN Analysis 9 min read March 26, 2026

From Stripe's 73% Cost Cut to SGLang's RadixAttention: Continuous Batching Deployments and Trends in 2026

GPU inference pipeline with batched requests flowing through parallel optimized processing lanes

Table of Contents

TL;DR

The shift: Continuous batching became the production default — the new race is disaggregated serving and intelligent KV cache management on top of it.
Why it matters: The gap between optimized and unoptimized inference stacks is now measured in multiples, not percentages.
What’s next: NVIDIA Dynamo 1.0 and prefill-decode separation are collapsing the serving stack into a single orchestration layer.

Three years ago, Continuous Batching was an academic paper. Today it processes 50 million daily API calls at Stripe on one-third the GPU fleet they used before. That is not an optimization. That is a different cost structure — and the teams still running Static Batching are funding the gap.

The Serving Layer Just Picked Sides

Thesis: The continuous batching engine race consolidated in early 2026 — the remaining competition is about what sits on top of it.

Every major Inference engine now ships continuous batching as default. vLLM hit v0.18.0 in March 2026. SGLang reached v0.5.9, now powering over 400,000 GPUs across xAI, AMD, NVIDIA, and Google Cloud (SGLang GitHub). TensorRT-LLM shipped v1.2.0 as stable. Three engines, three release cycles, one shared assumption: continuous batching underneath everything.

The Orca paper demonstrated 36.9x throughput improvement over FasterTransformer on GPT-3 175B (USENIX (OSDI ‘22)). That result ended the architectural debate.

The differentiation now happens one layer up: memory management, prefix caching, and disaggregated serving. Each engine made a different bet on which layer matters most — and those bets are paying out.

Three Engines, One GPU, Different Bets

On a single H100 80GB running Llama 3.3 70B at FP8 Quantization, the benchmarks split three ways (Spheron Blog — single-GPU setup, results vary by model and hardware):

TensorRT-LLM leads raw throughput: 2,780 tokens per second at 100 concurrent requests. SGLang: 2,460 tok/s. vLLM: 2,400 tok/s.

Throughput is one dimension.

SGLang posts the lowest time-to-first-token — 112ms at p50. vLLM: 120ms. In latency-sensitive applications serving requests with diverse Temperature And Sampling configurations, those milliseconds compound across millions of calls.

Cold start tells a different story entirely. vLLM spins up in roughly 62 seconds. SGLang: 58 seconds. TensorRT-LLM: approximately 28 minutes for engine compilation. If your deployment pattern involves frequent scaling events, that compilation overhead is a strategic constraint, not an inconvenience.

The real separation is in memory. vLLM’s PagedAttention eliminates 60-80% of KV cache waste through OS-style virtual memory management (vLLM Docs). SGLang’s RadixAttention takes a different path — automatic KV cache reuse across generation calls — which compounds for multi-turn and agentic workloads where prefix overlap is high (LMSYS Blog).

PagedAttention optimizes per-request efficiency. RadixAttention optimizes across requests. Both assume continuous batching as the substrate.

Security & compatibility notes:
SGLang RCE (CVE-2026-3060): Critical remote code execution via pickle deserialization in the disaggregation module. No fix available as of March 2026. Do not expose SGLang disaggregation endpoints to untrusted networks.
SGLang RCE (CVE-2026-3059): Critical remote code execution in the multimodal generation module via pickle.loads(). Restrict multimodal input pipelines to trusted sources.
vLLM V0 Engine: Deprecated; removal scheduled end of June 2026. Migrate to V1 (default since v0.8.0).
TensorRT-LLM Backend Change: PyTorch is now the default backend with C++ sampler enabled. Review config renames before upgrading to v1.2.0+.

The Movers

Stripe’s numbers set the bar: 73% inference cost reduction after switching from HuggingFace Transformers to vLLM, handling 50 million daily API calls on one-third the previous GPU fleet (Introl Blog, citing an internal Stripe case study — no primary Stripe publication has been identified).

But Stripe is one data point. The pattern is industry-wide.

NVIDIA shipped Dynamo 1.0 at GTC in March 2026 — a disaggregated serving orchestration layer claiming 7x throughput on Blackwell hardware (NVIDIA Newsroom). The adopter list: AWS, Azure, GCP, OCI, CoreWeave, Together AI, Cursor, Perplexity, PayPal, Pinterest.

That is not early adoption. That is consensus forming in real time.

Meta, LinkedIn, Mistral, and HuggingFace run vLLM with disaggregated prefill-decode separation in production (vLLM Docs). The DistServe paper demonstrated up to 7x higher request rates versus traditional serving approaches (DistServe (Hao AI Lab)).

Continuous batching is the floor. Disaggregated serving is the ceiling. And the ceiling is dropping fast.

Running Last Quarter’s Playbook

Teams still running monolithic inference — prefill and decode sharing the same GPU pool — are paying a tax on every request. Every long-context prefill that blocks a short decode is wasted GPU Utilization.

The Text Generation Inference stack you deployed 18 months ago still works. But “works” and “competitive” stopped being synonyms the moment disaggregated serving hit production.

The cost gap is widening. As engines optimize cache management and prefill-decode separation on top of continuous batching, teams on monolithic static stacks fall further behind with each quarterly release.

Anyone deploying SGLang in disaggregated mode without addressing its unpatched RCE vulnerabilities is running a risk no throughput number justifies.

You are either testing disaggregated serving now or optimizing a cost structure about to be undercut.

What Happens Next

Base case (most likely): Disaggregated prefill-decode becomes the default deployment pattern by Q4 2026. NVIDIA Dynamo or a similar orchestration layer handles the split automatically. Late movers absorb significantly higher per-request costs than early adopters. Signal to watch: vLLM and SGLang merge disaggregated serving into stable mainline releases. Timeline: 6-9 months.

Bull case: RadixAttention-style prefix caching combines with disaggregated serving — prefix reuse on dedicated prefill nodes, streaming decode on optimized pools. Cost per token drops by an order of magnitude for high-overlap workloads. Signal: Major cloud providers ship managed disaggregated endpoints with prefix-aware routing. Timeline: 9-15 months.

Bear case: An SGLang RCE exploitation triggers a production incident at a major deployer. The fallout slows disaggregated adoption by two quarters while teams wait for hardened alternatives. The stack fragments between NVIDIA-locked and open-source camps. Signal: A public post-mortem citing SGLang vulnerability exploitation. Timeline: 3-6 months.

Frequently Asked Questions

Q: How did Stripe reduce LLM inference costs 73 percent by switching to vLLM continuous batching? A: Stripe replaced HuggingFace Transformers with vLLM’s continuous batching, which schedules requests at the iteration level instead of waiting for full batches. The switch cut GPU requirements to one-third while handling 50 million daily API calls.

Q: vLLM vs TensorRT-LLM vs SGLang continuous batching throughput benchmarks on H100 in 2026? A: On one H100 80GB with Llama 3.3 70B FP8 at 100 concurrent requests: TensorRT-LLM leads at 2,780 tok/s, SGLang hits 2,460 tok/s, vLLM reaches 2,400 tok/s. Cold start heavily favors vLLM and SGLang over TensorRT-LLM.

Q: Which cloud providers and AI startups use continuous batching in production inference APIs? A: AWS, Azure, GCP, OCI, CoreWeave, Together AI, Cursor, Perplexity, PayPal, and Pinterest deploy NVIDIA Dynamo. Meta, LinkedIn, Mistral, and HuggingFace run vLLM with disaggregated serving in production.

Q: How will disaggregated serving prefill-decode separation and RadixAttention reshape continuous batching in 2026? A: Disaggregated serving isolates prefill and decode on separate GPU pools, eliminating contention. RadixAttention adds cross-request KV cache reuse. Together they deliver compound efficiency gains, especially for multi-turn agentic workloads with repeated prefixes.

The Bottom Line

Continuous batching is no longer the advantage — it is the baseline. The edge belongs to teams deploying disaggregated serving with intelligent caching, eyes wide open on the security risks. The gap between “early mover” and “industry default” is measured in quarters now, not years.

Disclaimer

This article discusses financial topics for educational purposes only. It does not constitute financial advice. Consult a qualified financial advisor before making investment decisions.

Sources

Introl Blog: vLLM Production Deployment: Inference Serving Architecture - Stripe vLLM case study analysis
Spheron Blog: vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks (2026) - H100 throughput and latency benchmarks
NVIDIA Newsroom: NVIDIA Enters Production With Dynamo 1.0 - Dynamo 1.0 launch and adopter list
vLLM Docs: vLLM Documentation - PagedAttention and disaggregated prefill documentation
LMSYS Blog: Fast and Expressive LLM Inference with RadixAttention and SGLang - RadixAttention technical explanation
USENIX (OSDI ‘22): Orca: Distributed Serving for Transformer-Based Generative Models - Original continuous batching research
SGLang GitHub: SGLang GitHub Repository - SGLang version, GPU fleet adoption, and RadixAttention details
DistServe (Hao AI Lab): DistServe: Disaggregating Prefill and Decoding - Disaggregated serving performance data

Aha Moments

MONA

The engineering story here is memory management, not scheduling. Continuous batching solved request-level inefficiency — PagedAttention and RadixAttention tackle the memory-level waste that remained. PagedAttention treats KV cache like virtual memory pages, eliminating fragmentation across a single request’s lifetime. RadixAttention builds a radix tree across requests, allowing prefix matches to skip redundant computation entirely. These are fundamentally different strategies: one optimizes allocation within a single generation, the other amortizes computation across a sequence of related calls. The throughput gap between engines is narrowing because they all batch the same way now. The gap is widening on cache strategy — and that is where the next order-of-magnitude improvement lives.

MAX

The operational split matters more than the raw numbers. TensorRT-LLM’s compilation overhead places it in long-running, stable deployments where cold start never triggers. vLLM and SGLang fit scaling-heavy patterns where instances spin up and cycle down frequently. Choosing the wrong engine for your deployment model costs more than the throughput delta between them. Disaggregated serving introduces another decision axis — your prefill and decode nodes can now run independently, each tuned for its phase. Most teams pick an engine based on a single benchmark. The real spec requires matching engine characteristics to deployment cadence, scaling triggers, and latency SLA. That spec just got more involved, and most teams have not written it down yet.

ALAN

Unpatched remote code execution vulnerabilities sitting inside an engine powering vast GPU fleets. The benchmarks look strong. The adoption curve looks impressive. But who is auditing the security surface of the inference layer with the same intensity applied to model weights and training data? We track tokens per second and cost per request with obsessive precision — yet there is no equivalent metric for exploitability per deployment. The race to optimize throughput has produced remarkable engineering. It has also produced a culture where known vulnerabilities persist in recommended software because the performance numbers are too compelling to pause. If a production breach traces back to a known, unpatched flaw in a widely deployed framework, what does that reveal about how this industry defines readiness — and who bears the cost when speed outpaces safety?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors