DAN Analysis 9 min read

From Stripe's 73% Cost Cut to SGLang's RadixAttention: Continuous Batching Deployments and Trends in 2026

GPU inference pipeline with batched requests flowing through parallel optimized processing lanes

TL;DR

  • The shift: Continuous batching became the production default — the new race is disaggregated serving and intelligent KV cache management on top of it.
  • Why it matters: The gap between optimized and unoptimized inference stacks is now measured in multiples, not percentages.
  • What’s next: NVIDIA Dynamo 1.0 and prefill-decode separation are collapsing the serving stack into a single orchestration layer.

Three years ago, Continuous Batching was an academic paper. Today it processes 50 million daily API calls at Stripe on one-third the GPU fleet they used before. That is not an optimization. That is a different cost structure — and the teams still running Static Batching are funding the gap.

The Serving Layer Just Picked Sides

Thesis: The continuous batching engine race consolidated in early 2026 — the remaining competition is about what sits on top of it.

Every major Inference engine now ships continuous batching as default. vLLM hit v0.18.0 in March 2026. SGLang reached v0.5.9, now powering over 400,000 GPUs across xAI, AMD, NVIDIA, and Google Cloud (SGLang GitHub). TensorRT-LLM shipped v1.2.0 as stable. Three engines, three release cycles, one shared assumption: continuous batching underneath everything.

The Orca paper demonstrated 36.9x throughput improvement over FasterTransformer on GPT-3 175B (USENIX (OSDI ‘22)). That result ended the architectural debate.

The differentiation now happens one layer up: memory management, prefix caching, and disaggregated serving. Each engine made a different bet on which layer matters most — and those bets are paying out.

Three Engines, One GPU, Different Bets

On a single H100 80GB running Llama 3.3 70B at FP8 Quantization, the benchmarks split three ways (Spheron Blog — single-GPU setup, results vary by model and hardware):

TensorRT-LLM leads raw throughput: 2,780 tokens per second at 100 concurrent requests. SGLang: 2,460 tok/s. vLLM: 2,400 tok/s.

Throughput is one dimension.

SGLang posts the lowest time-to-first-token — 112ms at p50. vLLM: 120ms. In latency-sensitive applications serving requests with diverse Temperature And Sampling configurations, those milliseconds compound across millions of calls.

Cold start tells a different story entirely. vLLM spins up in roughly 62 seconds. SGLang: 58 seconds. TensorRT-LLM: approximately 28 minutes for engine compilation. If your deployment pattern involves frequent scaling events, that compilation overhead is a strategic constraint, not an inconvenience.

The real separation is in memory. vLLM’s PagedAttention eliminates 60-80% of KV cache waste through OS-style virtual memory management (vLLM Docs). SGLang’s RadixAttention takes a different path — automatic KV cache reuse across generation calls — which compounds for multi-turn and agentic workloads where prefix overlap is high (LMSYS Blog).

PagedAttention optimizes per-request efficiency. RadixAttention optimizes across requests. Both assume continuous batching as the substrate.

Security & compatibility notes:

  • SGLang RCE (CVE-2026-3060): Critical remote code execution via pickle deserialization in the disaggregation module. No fix available as of March 2026. Do not expose SGLang disaggregation endpoints to untrusted networks.
  • SGLang RCE (CVE-2026-3059): Critical remote code execution in the multimodal generation module via pickle.loads(). Restrict multimodal input pipelines to trusted sources.
  • vLLM V0 Engine: Deprecated; removal scheduled end of June 2026. Migrate to V1 (default since v0.8.0).
  • TensorRT-LLM Backend Change: PyTorch is now the default backend with C++ sampler enabled. Review config renames before upgrading to v1.2.0+.

The Movers

Stripe’s numbers set the bar: 73% inference cost reduction after switching from HuggingFace Transformers to vLLM, handling 50 million daily API calls on one-third the previous GPU fleet (Introl Blog, citing an internal Stripe case study — no primary Stripe publication has been identified).

But Stripe is one data point. The pattern is industry-wide.

NVIDIA shipped Dynamo 1.0 at GTC in March 2026 — a disaggregated serving orchestration layer claiming 7x throughput on Blackwell hardware (NVIDIA Newsroom). The adopter list: AWS, Azure, GCP, OCI, CoreWeave, Together AI, Cursor, Perplexity, PayPal, Pinterest.

That is not early adoption. That is consensus forming in real time.

Meta, LinkedIn, Mistral, and HuggingFace run vLLM with disaggregated prefill-decode separation in production (vLLM Docs). The DistServe paper demonstrated up to 7x higher request rates versus traditional serving approaches (DistServe (Hao AI Lab)).

Continuous batching is the floor. Disaggregated serving is the ceiling. And the ceiling is dropping fast.

Running Last Quarter’s Playbook

Teams still running monolithic inference — prefill and decode sharing the same GPU pool — are paying a tax on every request. Every long-context prefill that blocks a short decode is wasted GPU Utilization.

The Text Generation Inference stack you deployed 18 months ago still works. But “works” and “competitive” stopped being synonyms the moment disaggregated serving hit production.

The cost gap is widening. As engines optimize cache management and prefill-decode separation on top of continuous batching, teams on monolithic static stacks fall further behind with each quarterly release.

Anyone deploying SGLang in disaggregated mode without addressing its unpatched RCE vulnerabilities is running a risk no throughput number justifies.

You are either testing disaggregated serving now or optimizing a cost structure about to be undercut.

What Happens Next

Base case (most likely): Disaggregated prefill-decode becomes the default deployment pattern by Q4 2026. NVIDIA Dynamo or a similar orchestration layer handles the split automatically. Late movers absorb significantly higher per-request costs than early adopters. Signal to watch: vLLM and SGLang merge disaggregated serving into stable mainline releases. Timeline: 6-9 months.

Bull case: RadixAttention-style prefix caching combines with disaggregated serving — prefix reuse on dedicated prefill nodes, streaming decode on optimized pools. Cost per token drops by an order of magnitude for high-overlap workloads. Signal: Major cloud providers ship managed disaggregated endpoints with prefix-aware routing. Timeline: 9-15 months.

Bear case: An SGLang RCE exploitation triggers a production incident at a major deployer. The fallout slows disaggregated adoption by two quarters while teams wait for hardened alternatives. The stack fragments between NVIDIA-locked and open-source camps. Signal: A public post-mortem citing SGLang vulnerability exploitation. Timeline: 3-6 months.

Frequently Asked Questions

Q: How did Stripe reduce LLM inference costs 73 percent by switching to vLLM continuous batching? A: Stripe replaced HuggingFace Transformers with vLLM’s continuous batching, which schedules requests at the iteration level instead of waiting for full batches. The switch cut GPU requirements to one-third while handling 50 million daily API calls.

Q: vLLM vs TensorRT-LLM vs SGLang continuous batching throughput benchmarks on H100 in 2026? A: On one H100 80GB with Llama 3.3 70B FP8 at 100 concurrent requests: TensorRT-LLM leads at 2,780 tok/s, SGLang hits 2,460 tok/s, vLLM reaches 2,400 tok/s. Cold start heavily favors vLLM and SGLang over TensorRT-LLM.

Q: Which cloud providers and AI startups use continuous batching in production inference APIs? A: AWS, Azure, GCP, OCI, CoreWeave, Together AI, Cursor, Perplexity, PayPal, and Pinterest deploy NVIDIA Dynamo. Meta, LinkedIn, Mistral, and HuggingFace run vLLM with disaggregated serving in production.

Q: How will disaggregated serving prefill-decode separation and RadixAttention reshape continuous batching in 2026? A: Disaggregated serving isolates prefill and decode on separate GPU pools, eliminating contention. RadixAttention adds cross-request KV cache reuse. Together they deliver compound efficiency gains, especially for multi-turn agentic workloads with repeated prefixes.

The Bottom Line

Continuous batching is no longer the advantage — it is the baseline. The edge belongs to teams deploying disaggregated serving with intelligent caching, eyes wide open on the security risks. The gap between “early mover” and “industry default” is measured in quarters now, not years.

Disclaimer

This article discusses financial topics for educational purposes only. It does not constitute financial advice. Consult a qualified financial advisor before making investment decisions.

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors

Share: