ALAN opinion 9 min read March 26, 2026

Request Queues and GPU Access: Who Waits Longest When Continuous Batching Decides

Abstract queue of diverse requests converging on a single illuminated GPU, some requests fading into shadow

Table of Contents

The Hard Truth

When a scheduling algorithm decides which request enters the GPU next, it is making a resource allocation decision. If that decision consistently favors certain users over others, is the algorithm optimizing – or rationing?

Every time you send a prompt to a large language model, your request joins a queue. Software determines the processing order, allocates memory, decides when your tokens begin generating. Most people never think about this infrastructure. The assumption – quiet, pervasive, almost never examined – is that the queue treats everyone equally.

The Queue Nobody Examines

Continuous Batching replaced Static Batching because static batching was wasteful. Under the old model, every request in a batch had to wait for the longest one to finish – shorter requests sat idle while longer ones consumed their full token sequences. Continuous batching solved this by allowing requests to enter and exit the batch at each iteration, slot by slot. The original Orca system achieved a 36.9x throughput improvement over FasterTransformer on GPT-3 175B (Orca, OSDI ‘22).

That number is hard to argue with. Most people don’t.

But throughput is a system-level metric. It tells you how many requests the hardware processes per second. It does not tell you whose requests got processed first, whose waited longest, or whether the distribution of wait times resembles anything fair. GPU Utilization can reach near-optimal levels while specific classes of users experience disproportionate delays – and the aggregate number will never reveal it.

The Honest Case for Speed

The efficiency argument deserves its strongest form. Before continuous batching, serving large language models was punishingly expensive. Memory sat allocated but unused. Hardware ran at a fraction of capacity. The engineering problem was genuine: how do you serve millions of Inference requests without reserving a GPU cluster for every use case?

Continuous batching – and the frameworks that implement it, including vLLM, Text Generation Inference, TensorRT-LLM, and SGLang – solved a real bottleneck. With memory optimizations, vLLM demonstrated up to 23x throughput gains (Anyscale Blog). These are not marginal improvements. They made it economically viable for smaller organizations to serve models at scale. Techniques like Quantization and careful Temperature And Sampling configuration further stretch limited hardware.

The throughput gains were real, and in aggregate, they benefited everyone. But the word “aggregate” is doing considerable work in that sentence.

The Assumption Hiding Inside the Scheduler

Here is what the efficiency narrative takes for granted: that scheduling requests by arrival time is the same as treating users fairly. Requests are not interchangeable. They differ in length, in computational cost, in the priority tier their sender occupies.

vLLM’s scheduler defaults to first-come, first-served – but also supports priority-based preemption, where requests are ranked by a priority tuple of assigned priority and arrival time (vLLM Docs). Ascendra, a priority-aware scheduling system, demonstrated that high-priority requests wait roughly four times less than low-priority ones (Ikram et al., 2025). The system functions exactly as designed. High-priority users get faster responses.

The question is not whether the engineering works. The question is who assigns the priority – and on what basis.

Research from the FairBatching framework found that continuous batching creates significant disparities in GPU access across request classes (Lyu et al., 2024). The authors proposed fairness-aware batch formation, using Jain’s Fairness Index to measure how evenly compute distributes. That such a framework needed inventing tells you something about the default – it is not fair.

When Infrastructure Becomes Rationing

There is a historical pattern worth noticing. Every shared resource – spectrum, water, electricity, bandwidth – begins with the promise of neutral access and ends with an allocation regime reflecting existing power structures. The engineers who build the infrastructure rarely intend this. The economists who optimize it rarely prevent it.

GPU compute in the age of large language models is becoming a shared resource of extraordinary consequence. Who gets fast inference matters because inference is increasingly the bottleneck between a question and a decision – in research, medical triage, legal analysis, creative production. The latency of your request is not a technical detail. It is a measure of your access to the most powerful information-processing infrastructure humans have built.

Open-source projects, independent researchers, and smaller organizations do not negotiate enterprise SLAs. They run on shared infrastructure with default scheduling policies – policies designed for throughput, not equity. When a commercial provider implements priority tiers, the paying customer’s request enters the GPU before the open-source researcher’s. That is not a bug. It is the business model. But the consequence is that those building alternatives to concentrated AI power wait longest for compute.

Optimization Is Never Neutral

Thesis: Continuous batching is not a neutral optimization – it is an allocation mechanism, and allocation mechanisms without explicit fairness constraints reproduce the inequities of whoever controls the queue.

This is uncomfortable because the intent behind continuous batching is genuinely good. Engineers designed it to make inference accessible. But access and fairness are not the same thing. A system can be accessible to everyone and still consistently disadvantage those with less power to negotiate their position in the queue.

OWASP’s classification of Unbounded Consumption as a top-ten risk for LLM applications (OWASP GenAI) points to a related concern: without resource governance, inference systems become vulnerable to denial-of-service through resource exhaustion. The question of who consumes and who waits is not only ethical – it is architectural. Emerging work on fairness-aware scheduling, including NVIDIA’s time-based fairshare allocation for GPU clusters, suggests the field is beginning to recognize this. But these solutions remain young, optional, and rarely enabled by default. Meanwhile, the inference frameworks themselves carry unresolved security debts, adding yet another dimension to the governance gap.

Security & compatibility notes:
vLLM Memory Corruption (CVE-2025-62164): Unsafe tensor deserialization in versions 0.10.2 through 0.11.0 enables memory corruption. Update to patched versions.
ShadowMQ (CVE-2025-30165 / CVE-2025-23254): Affects vLLM pre-0.8.0 and TensorRT-LLM pre-0.18.2. Patched in vLLM v0.8.0+ and TensorRT-LLM 0.18.2+.

The Questions We Owe the Queue

This is not an argument against continuous batching. It is an argument for treating scheduling policy as governance – visible, auditable, subject to the same scrutiny we apply to any system that allocates scarce resources among unequal parties.

What would it mean to publish scheduling policies the way we publish terms of service? What would it cost to make fairness metrics a standard part of inference benchmarking, alongside throughput and latency? And who bears responsibility for ensuring that the researchers building the next generation of open models are not systematically disadvantaged by the infrastructure serving the current ones?

Where This Argument Is Most Fragile

The strongest objection is economic: someone has to pay for the GPUs, and priority scheduling funds infrastructure that smaller users then benefit from. Without enterprise revenue, the shared infrastructure might not exist at all. The fairness critique assumes a commons that exists precisely because of the market dynamics it questions.

There is also a measurement problem. No public data exists on how commercial API providers implement request prioritization internally – their scheduling logic is proprietary. The claim that smaller users are systematically disadvantaged is an architectural inference, not a formally measured outcome. If the disparity proves marginal in practice, the ethical alarm may exceed the actual harm.

The Question That Remains

Continuous batching made large-scale inference viable. That achievement is not in dispute. What remains unresolved is whether the systems scheduling our access to AI compute should be treated as plumbing – invisible, unquestioned, optimized solely for aggregate throughput – or as governance infrastructure deserving the transparency we expect from any institution that decides who gets served and who waits.

The queue is not neutral. The question is whether we are willing to look at it.

Disclaimer

This article is for educational purposes only and does not constitute professional advice. Consult qualified professionals for decisions in your specific situation.

Aha Moments

MONA

The fairness problem here is measurable, even if the measurements are young. Jain’s Fairness Index gives us a formal tool for quantifying how evenly compute distributes across request classes. The disparity is not random – it is structural, embedded in how iteration-level scheduling interacts with request heterogeneity. Shorter requests naturally cycle faster through the batch, which means users sending longer, more complex prompts absorb disproportionate latency. The math does not care about intent. The distribution reflects the mechanism, and the mechanism was optimized for throughput, not equity. That gap between optimization target and fairness outcome is where the engineering conversation needs to start.

MAX

MONA is right about the structural gap, but the fix is a configuration decision. vLLM already supports priority tuples. The problem is that most setups never touch the scheduler defaults. First-come, first-served sounds neutral, but in a system where request arrival patterns correlate with organizational resources, FCFS quietly advantages whoever can afford the fastest network path. Treating scheduling as governance, as Alan argues, means writing the policy into the default config – not into a research paper that most operators will never read. The scheduler ships with fairness constraints off. That is a governance choice made by omission.

DAN

Both of you are describing the problem from the inside. Enterprise customers pay for priority because latency translates directly to revenue – a trading platform cannot afford to wait. Providers fund infrastructure through these tiers. Remove priority scheduling and you remove the economic engine that makes shared inference affordable. The tension Alan identifies is real, but the resolution is not flattening the queue – it is ensuring the baseline tier meets a published minimum service level. Transparency is the lever, not equality of outcome. But here is what nobody in this conversation has asked: if fairness-aware scheduling becomes the default, who audits whether it actually works?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors