Tokens Per Second

Also known as: TPS, output throughput, generation rate

Tokens Per Second: Tokens per second (TPS) measures how many output tokens an LLM generates each second during inference — the throughput metric that captures generation speed after the first token arrives, used to evaluate streaming responsiveness and batch processing capacity.

Tokens per second (TPS) is the rate at which an LLM generates output tokens during inference — the primary throughput metric used to evaluate how fast a model produces text under real-world load.

What It Is

TPS answers a specific question: once a model starts responding, how quickly does text arrive? That distinction matters for anyone building on top of language models. A slow-starting model might still produce 80 tokens per second once it gets going — or a fast-starting model might slow to a crawl after the first sentence. TPS captures that sustained generation speed, which is why it sits alongside TTFT and p99 latency as a core measurement in LLM load testing.

Think of TPS like reading words off a teleprompter. TTFT (time to first token) is how long before the prompter starts scrolling. TPS is how fast the words move once it does. Both affect the person reading, but they describe entirely different parts of the experience.

The mechanics behind TPS trace back to how transformer models generate output: one token at a time, in a sequential loop called autoregressive decoding. Each token requires the model to attend over the entire context accumulated so far — the prompt plus every token generated since. As the response grows longer, that attention computation grows too. This is why TPS can drop for very long responses: later tokens cost more to produce than earlier ones.

Three factors drive TPS on the infrastructure side. First, memory bandwidth — moving model weights and the KV cache (the stored attention states from all previous tokens) between GPU memory and compute units is often the bottleneck, not the floating-point operations themselves. Second, batch size — when an inference server groups multiple requests and processes them in parallel, total throughput goes up even if per-request TPS stays similar. Third, quantization — compressing model weights to lower precision reduces memory bandwidth requirements and typically increases TPS, with some tradeoff in output quality.

Under load, TPS behaves non-linearly. At low concurrency, a single request often gets the full GPU’s attention and achieves the highest per-request TPS. As concurrency increases, the inference server batches requests, so total tokens-per-second across all requests goes up — but any individual user sees their personal TPS flatten or drop. This gap between aggregate throughput and per-user throughput is what makes TPS a tricky metric to communicate: a benchmark showing “200 TPS” could mean 200 tokens per second for one isolated request, not 200 tokens per second per user at 50 concurrent connections.

How It’s Used in Practice

The most common encounter with TPS is during load testing before shipping an AI feature. A team building a chatbot or code assistant runs a battery of concurrent requests against their inference endpoint — whether a hosted API or a self-managed inference server — and measures how TPS holds up as the number of simultaneous users increases. The goal is to find the point where TPS degrades enough to hurt the user experience before the feature goes live.

Tools like GenAI Perf, locust-based test harnesses, or vendor-provided benchmarking scripts typically report TPS alongside TTFT and p99 latency. The combination gives a fuller picture: TTFT tells you about perceived responsiveness, TPS tells you about sustained streaming quality, and p99 latency tells you about the worst-case experience under load.

TPS also appears in cost calculations. Most hosted LLM APIs charge per output token. Knowing the average TPS of a model, combined with expected session lengths and traffic volume, lets a team estimate monthly API costs before committing to a provider.

Pro Tip: Never benchmark TPS at concurrency 1 and treat it as representative. Run tests at 10, 25, and 50 concurrent requests and watch where TPS per user starts dropping. That degradation curve — not the peak headline number — is what determines whether your infrastructure holds under real traffic.

When to Use / When Not

Scenario	Use	Avoid
Streaming chat interface where response delay hurts UX	✅
Comparing inference providers before committing	✅
Assessing reasoning quality between two models		❌
Measuring first-response delay for interactive prompts		❌
Estimating infrastructure capacity for batch generation jobs	✅
Choosing between models on output accuracy grounds		❌

Common Misconception

Myth: A model with higher TPS is a better model.

Reality: TPS measures how fast tokens are produced, not how useful those tokens are. A smaller, heavily quantized model might generate twice as many tokens per second as a larger one while producing less accurate or less coherent output. TPS is a throughput metric, not a quality signal. Benchmark it for performance requirements, not to rank model capability.

One Sentence to Remember

TPS tells you how fast an LLM writes, not how well it thinks — so measure it at the concurrency levels your actual users will create, not at the single-request peak your vendor publishes.

FAQ

Q: What counts as a good tokens per second rate for an LLM API? A: It depends on the use case. Streaming chat generally needs to feel above around 30 TPS to avoid a sluggish impression. Batch processing tolerates lower per-request rates if total throughput is sufficient. Always compare at realistic concurrency, not peak single-request speed.

Q: How is tokens per second different from TTFT? A: TTFT (time to first token) measures the delay before any output arrives. TPS measures the generation rate once output starts flowing. A model can have low TTFT but slow sustained generation, or high TTFT followed by fast output — both metrics matter for different aspects of perceived speed.

Q: Does tokens per second decrease when many users send requests at the same time? A: Per-user TPS typically decreases as concurrency rises, because the inference server’s GPU resources are shared across more requests. Total system throughput may increase, but each individual request gets a smaller share of compute. Load testing at realistic concurrency levels reveals this degradation before it reaches production.

Expert Takes

MONA

TPS is bounded by memory bandwidth, not floating-point throughput. Modern GPU accelerators move weights and KV cache data faster than they ever could in previous generations, yet inference remains memory-bound for most transformer architectures. The practical implication is that quantization gains are real and not just a compromise — reducing precision cuts the data moved per token, which directly raises TPS without proportional quality loss, until you push quantization far enough that weight precision becomes the limiting factor.

MAX

TPS is the metric that determines whether your LLM gateway can sustain streaming to real users under load. In a multi-model routing setup, you route high-concurrency, latency-sensitive traffic to models with better TPS characteristics and reserve slower, higher-quality models for batch or low-concurrency workloads. The gateway’s job is to know each model’s TPS curve — how it degrades as concurrency rises — and route accordingly, not just pick the cheapest or most accurate option in isolation.

DAN

Teams shopping for inference providers focus on quality benchmarks and ignore TPS until they hit a wall in production. By then, switching is expensive. TPS should be part of the vendor evaluation scorecard from the start — measured at your expected peak concurrency, not the provider’s headline number. The providers who publish honest TPS-under-load data are telling you something about how they think about production. The ones who don’t are also telling you something.

ALAN

TPS benchmarks are easy to manipulate. Running one request on dedicated hardware and calling the result representative misrepresents what users experience in shared, high-concurrency production environments. The gap between benchmark TPS and production TPS is rarely published and almost never disclosed in vendor comparisons. When a team makes infrastructure decisions based on headline throughput figures, they optimize for a scenario that does not exist. The number that matters comes from your own testing at your actual load.

Back to Glossary