Request Concurrency
Also known as: concurrent requests, inference concurrency, request parallelism
- Request Concurrency
- The number of simultaneous inference requests an LLM server processes at a given moment. Unlike traditional web services, LLM concurrency has non-linear effects: long-prompt requests can monopolize GPU prefill capacity, creating batch interference that spikes tail latency for all queued requests.
Request concurrency is the number of simultaneous inference requests an LLM server handles at once — the variable that separates a single-user benchmark from a production-realistic load test.
What It Is
If you are evaluating whether an LLM deployment can handle real traffic, request concurrency is the number you need to get right first. A test at one concurrent request tells you the best-case latency for a single user. A test at fifty concurrent requests tells you what happens when real users arrive at the same time. The gap between those two measurements is often where production surprises come from.
In a traditional web API, concurrency scales roughly linearly: doubling requests roughly doubles resource use. LLM inference does not work this way, because generation has two distinct phases with different cost structures.
Prefill processes the entire input prompt in one pass and produces the first token. It is compute-intensive and scales with prompt length — a 1,000-token prompt takes longer to prefill than a 100-token one. Decode then generates one token at a time until the response is complete. Both phases compete for the same GPU compute, and prefill cannot easily yield mid-operation to let another request through.
This creates what engineers call batch interference. When a long-prompt request enters a server where other requests are queued, it monopolizes the GPU during its prefill phase. Every other request waits until that prefill finishes. The longer the incoming prompt, the longer the queue. As concurrency rises, the chance that a long-prompt request blocks the queue grows, and tail latency climbs faster than the number of users does.
Think of a single checkout lane at a grocery store. One customer with a full cart arrives just before you. Your transaction takes thirty seconds; theirs takes five minutes. In an LLM server, one request with a 10,000-token prompt is that full cart — there is one GPU prefill lane, and no request can skip ahead.
According to Spheron Blog, the effect is pronounced: at eight concurrent requests, P99 time-to-first-token runs around ninety milliseconds. At thirty-two concurrent requests, it climbs to roughly two hundred eighty milliseconds. At sixty-four concurrent requests, it reaches approximately four hundred eighty milliseconds — more than a fivefold increase for an eightfold rise in concurrency.
This non-linearity is exactly why concurrency must be calibrated before any other metric in an LLM load test carries meaning.
How It’s Used in Practice
Load testing tools expose request concurrency as their primary control. According to AIPerf GitHub, the aiperf benchmarking tool accepts a --concurrency flag specifying how many requests to send simultaneously, and reports latency and throughput metrics at each level independently. According to LLMPerf GitHub, LLMPerf controls concurrency through the number of Ray workers spawned per test — each worker handles one request, so adding workers directly raises the concurrency level.
A standard workflow is a concurrency sweep: run the test at successively higher concurrency levels and record P50 and P99 latency at each step. The resulting curve shows where batch interference starts to dominate and where throughput plateaus. That curve — not a single point result — is what tells you whether a deployment is ready for production traffic.
Pro Tip: Record latency at each concurrency step, not only at your expected peak. The shape of the curve tells you more than any single point: an aggressive early slope means your workload is long-prompt-heavy and will suffer from batch interference quickly; a gradual slope means shorter prompts are dominating and the server has more headroom. That shape should inform your SLA thresholds and gateway rate limits.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Measuring how batch interference affects tail latency under load | ✅ | |
| Single-user accuracy or output quality benchmarks | ❌ | |
| Establishing P99 baselines before production sizing decisions | ✅ | |
| Comparing two models on output quality for a fixed prompt set | ❌ | |
| Calibrating rate limits and SLA thresholds for an LLM gateway | ✅ | |
| Testing a stub or cached-response endpoint without a real model | ❌ |
Common Misconception
Myth: Higher concurrency always improves GPU efficiency by packing more work into each compute cycle.
Reality: Beyond the point where batch interference sets in, adding concurrent requests extends queue wait times and drives up tail latency. GPU utilization may stay high, but effective throughput — requests completed within SLA — drops. The efficiency gain saturates well before the latency cost becomes acceptable.
One Sentence to Remember
Concurrency is the first variable to set in any LLM load test — not because higher is better, but because its non-linear effect on tail latency defines the difference between a benchmark and a production readiness check.
FAQ
Q: What is request concurrency in LLM load testing? A: The number of simultaneous inference requests sent to an LLM server during a test. Setting it too low produces optimistic results; setting it to match expected production traffic reveals how batch interference shapes tail latency.
Q: Why does higher concurrency spike latency in LLM inference? A: Long-prompt requests monopolize GPU prefill capacity while queued requests wait. As more requests run simultaneously, the probability of this queuing effect increases, causing P99 latency to grow faster than the concurrency level rises.
Q: How do I choose the right concurrency level for a load test? A: Start low and increase in steps, recording latency at each level. Your target concurrency should reflect peak traffic, not average traffic — SLAs hold or break at peaks, not means.
Sources
- Spheron Blog: LLM Inference SLO Engineering: TTFT, ITL, and P99 Latency Budgets for Production AI (2026) — P99 TTFT measurements across concurrency levels, batch interference analysis
- AIPerf GitHub: AIPerf: Comprehensive benchmarking tool for generative AI inference — concurrency flag documentation and benchmarking methodology
Expert Takes
Batch interference is not a scheduling flaw — it is a consequence of how autoregressive generation works. Prefill is a dense matrix operation whose cost scales with sequence length; decode is a memory-bandwidth-bound process that generates one token per step. When concurrent requests mix long and short prompts, the GPU serializes prefill operations and the tail of the latency distribution stretches. The interference is deterministic, not stochastic — long prompts will always delay queued requests.
Most teams run a single-point benchmark and call it load testing. The more useful output is a concurrency sweep: start low, step up incrementally, record tail latency at each level. The resulting curve shows the inflection point where batch interference begins dominating, and that inflection point is what your SLA and rate limiter should be built around — not the latency from a single comfortable test run.
Every team that has shipped an LLM feature has seen it: demo runs fast, production runs slow. The production environment handles many concurrent users; the demo handled one. Concurrency testing is not optional — it is how you find out whether the model you approved in a single-request eval can handle the traffic your product will send. The test that skips this step is not a test; it is a story you tell yourself before deployment.
The concurrency level you choose for a benchmark determines the result you get. Test at low concurrency and the model looks fast and reliable. Test at realistic production concurrency and it may fall short of its advertised SLA. This is not deception, exactly — but it is a choice, and it is rarely made transparently. Load test reports without a stated concurrency level are not verifiable. What was measured, and under what conditions, matters as much as the number itself.