Inter Token Latency

Also known as: ITL, inter-token delay, token-to-token latency

Inter Token Latency: Inter-token latency (ITL) is the time elapsed between each successive token during a streaming LLM response, measured in milliseconds per token. It is the inverse of tokens-per-second throughput and determines how smooth or choppy a streamed response feels to the user.

Inter-token latency is the delay between each successive token in a streaming LLM response, measured in milliseconds per token — the inverse of tokens-per-second throughput and a direct indicator of streaming consistency under load.

What It Is

When an LLM generates a response, it produces tokens one at a time and streams them to the client. Inter-token latency (ITL) measures the gap between each token’s arrival. A low ITL means text appears at a steady, readable pace. A high ITL means the stream stutters — tokens arrive in bursts or with noticeable pauses between individual characters and words.

Think of it like reading subtitles on a screen. If captions appear at a smooth, even rhythm, you follow along naturally. If they freeze and then dump several words at once, your reading rhythm breaks. ITL is the equivalent measurement for streamed AI text — not how fast the response begins, but how evenly it flows once it starts.

ITL and tokens-per-second (TPS) are two sides of the same coin. If a model generates 20 tokens per second, the average ITL is 50 milliseconds per token. But TPS is a throughput aggregate; ITL reveals the variance within that average. A model averaging 20 TPS might still have ITL spikes of several hundred milliseconds between individual tokens — variation that the aggregate TPS number hides entirely.

In LLM load testing, ITL degrades predictably as concurrent request volume increases. A model serving one user at a time might sustain a low, steady ITL. Under many simultaneous requests, GPU resources are shared across those requests, and the same model can see ITL climb significantly. The text still arrives — but the experience of reading it in real time deteriorates. This is why LLM load testing tracks ITL across increasing concurrency levels, not just at idle.

ITL is distinct from time-to-first-token (TTFT). TTFT captures how long the user waits before anything appears on screen — the pre-fill stage where the model processes the prompt. ITL begins after that first token arrives and describes the rhythm of the rest of the response. A system can have excellent TTFT (fast prompt processing) but poor ITL if it batches tokens internally and releases them in groups rather than one at a time.

For interactive applications — chat interfaces, coding assistants, document editors with AI suggestions — ITL is often more noticeable than TTFT. Once text starts flowing, users expect a consistent pace. Pauses between tokens mid-sentence break the reading experience even when the first token arrived quickly.

How It’s Used in Practice

In LLM load testing, ITL is measured alongside TTFT and p99 latency to characterize how a model behaves under concurrent load. A benchmark run fixes the number of simultaneous requests, generates a target number of output tokens, and records the time between each token arrival at the client. Engineers then summarize the ITL distribution: p50 (median), p90, and p99. The p99 ITL reveals worst-case streaming behavior — what the slowest one percent of inter-token gaps look like when the system is under stress.

Observability tools that support streaming LLM calls capture ITL per request and aggregate it over time. This lets teams set alerts when median ITL exceeds a defined ceiling or when p99 ITL spikes during traffic surges. Tools like GenAI-Perf (part of the Triton inference ecosystem) and similar LLM benchmarking frameworks expose ITL as a first-class metric alongside TTFT, so it can be tracked across model providers, deployment configurations, and concurrency levels.

Pro Tip: When benchmarking, don’t report only average TPS. Log the full ITL distribution. A system with a healthy average TPS can still feel choppy if its p99 ITL is high — that tail behavior is what users notice, not the mean.

When to Use / When Not

Scenario	Use	Avoid
Benchmarking a streaming chat application under load	✅
Evaluating a batch document processing pipeline with no streaming		❌
Diagnosing why a stream feels slow despite acceptable TTFT	✅
Comparing two model APIs for non-streaming, synchronous endpoints		❌
Setting SLA thresholds for a real-time coding assistant	✅
Measuring throughput of an offline embedding or classification pipeline		❌

Common Misconception

Myth: ITL and TTFT measure the same thing, so tracking one is enough.

Reality: TTFT measures how long before the first token arrives — it captures prompt processing latency. ITL measures the gaps between every subsequent token after that first one. A system optimized for TTFT can still produce a choppy stream if it generates tokens in bursts. Both metrics are necessary to fully describe the streaming experience, and they respond differently to load: TTFT often degrades due to queuing, while ITL degrades due to GPU compute sharing during generation.

One Sentence to Remember

ITL tells you how steady the stream is, not just how fast it starts — and that steadiness is what users actually feel during a real-time interaction.

FAQ

Q: How is inter-token latency calculated? A: Record the timestamp of each token as it arrives at the client. The ITL for each token is the difference between its arrival time and the previous token’s arrival time. Average or percentile values summarize the distribution across a full response or a load test run.

Q: What is the difference between inter-token latency and time-to-first-token? A: TTFT measures the wait from sending a request until the first token arrives — it reflects prompt processing speed. ITL measures the gaps between each subsequent token. TTFT reveals how fast the model starts; ITL reveals how smoothly it continues.

Q: Does a higher tokens-per-second rate always mean lower inter-token latency? A: On average, yes — they are mathematical inverses. But TPS is an aggregate. A model with high average TPS can still have high ITL variance if tokens are processed in micro-batches and released unevenly. Measuring the ITL distribution catches this behavior; average TPS alone does not.

Expert Takes

MONA

ITL exposes the autoregressive generation loop directly. Each token is produced sequentially, conditioned on all prior tokens. Variation in ITL reflects two sources: token complexity (rare subword units take marginally longer to sample) and hardware pressure — at high concurrency, the GPU’s compute splits across requests, so each forward pass takes longer to complete. ITL doesn’t describe the model in isolation; it describes the model under load.

MAX

In any streaming API integration, treat ITL as a named resource in your SLA, not an afterthought. Set a p99 ITL threshold alongside your TTFT target and monitor both separately. An ITL spike mid-response breaks user flow in ways a slow TTFT does not — users wait for a response to start, but disengage when text hesitates mid-sentence. Log ITL per session to catch user-facing degradation that aggregate metrics miss.

DAN

Teams shipping production AI products who don’t track ITL are flying blind. TTFT is the headline number that looks good in demos; ITL is what users complain about in production. Load-test at three to five times your expected peak concurrency, record the tail ITL, and build that number into your capacity planning. The cost of discovering your streaming UX degrades at scale is always higher after launch than before it.

ALAN

ITL thresholds are design choices that encode assumptions about whose attention is worth optimizing for. A tight ITL target implies a user reading text in real time — a specific interaction model. Batch pipelines, accessibility tools converting text to speech, or applications serving users with different cognitive processing speeds may need entirely different ITL tolerances. Before treating any ITL benchmark as a universal standard, ask whether the benchmark was built for the interaction context you are actually designing for.

Back to Glossary