GenAI-Perf
Also known as: NVIDIA GenAI-Perf, genai-perf CLI, GenAI Performance Analyzer
- GenAI-Perf
- GenAI-Perf was NVIDIA’s open-source CLI tool for benchmarking LLM inference performance, measuring time to first token, inter-token latency, tokens per second, and requests per second against any OpenAI-compatible endpoint. Deprecated in mid-2026 in favor of NVIDIA AIPerf.
GenAI-Perf was NVIDIA’s open-source CLI tool for measuring the streaming-specific performance of LLM inference APIs — tracking time to first token, inter-token latency, tokens per second, and throughput across any OpenAI-compatible endpoint.
What It Is
Standard HTTP load testing tools were designed for request-response APIs: send a request, wait for a response, record how long it took. LLM inference breaks this model. Responses arrive as a stream of tokens, not a single payload, so the latency profile has two distinct phases — the time before the first token appears (time to first token, or TTFT) and the pace at which tokens arrive after that (inter-token latency, or ITL). An API that looks fast on average response time can still feel broken if TTFT is high or if tokens trickle in with visible gaps. Standard tools miss both signals.
GenAI-Perf was built to address that measurement gap. Think of it as a race clock that measures not just the finish time, but the first lap split and the pace of every lap after that — each a distinct signal that reveals a different bottleneck. As an open-source CLI from NVIDIA, it connected to any OpenAI-compatible LLM endpoint, ran configurable concurrent load profiles, and reported TTFT, ITL, tokens per second (TPS), and requests per second (RPS) for every request in the test run. This gave infrastructure teams the streaming-specific visibility that general HTTP load testing tools cannot provide — the granularity that separates “the API responded” from “the model actually served the user without delay.”
Under the hood, GenAI-Perf acted as a client-side harness. It issued simultaneous requests to the target endpoint, tracked the token stream for each one, and assembled percentile distributions — p50, p95, p99 — of TTFT and ITL across the full run. The distribution matters more than the average: load testing LLMs is fundamentally about finding tail behavior — what degrades first as concurrency climbs, and how far it degrades before hitting a hard limit.
As of mid-2026, GenAI-Perf is officially deprecated. According to NVIDIA Triton Docs, no new features are being developed for it. NVIDIA directs teams to NVIDIA AIPerf, maintained under the ai-dynamo organization on GitHub, which reached v0.10.0 in June 2026, according to AIPerf GitHub. The measurement concepts GenAI-Perf established — streaming-aware latency measurement, per-metric percentile reporting, OpenAI-compatible test targets — are carried forward in AIPerf.
How It’s Used in Practice
The primary use case was evaluating inference infrastructure before go-live. A platform or ML engineering team would run GenAI-Perf against their LLM serving stack — Triton Inference Server, vLLM, or a cloud endpoint — at increasing concurrency levels: 10, 50, 100 simultaneous requests. The tool tracked where TTFT and ITL began to degrade as load climbed, answering the critical pre-production question: at what request volume does the model start making users wait?
A second common scenario was comparing inference backend options. When the same model could run on multiple backends or across cloud providers, GenAI-Perf generated side-by-side TTFT and throughput numbers from the same prompt workload — replacing vendor performance claims with measured evidence.
For new projects today, the setup path changes. According to AIPerf GitHub, installation is pip install aiperf, replacing the Triton SDK package path that GenAI-Perf required. The test parameters and metric definitions remain compatible — the migration is a package swap, not a methodology change.
Pro Tip: Set your test prompt length to match your actual production workload. A short-prompt benchmark showing clean TTFT numbers can mask the regression that appears when real long system prompts start filling the KV cache under load.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Starting a new LLM benchmarking project today | ❌ GenAI-Perf is deprecated — use NVIDIA AIPerf | |
| Maintaining existing GenAI-Perf scripts short-term | ✅ acceptable while planning migration | |
| Measuring TTFT and ITL distribution under concurrent load | ✅ Use AIPerf (successor) | |
| Load testing standard REST or GraphQL APIs | ❌ Use k6, Locust, or Gatling instead | |
| Comparing multiple OpenAI-compatible LLM endpoints | ✅ Use AIPerf | |
| Detecting output quality regression under high load | ❌ Latency tools do not measure this |
Common Misconception
Myth: GenAI-Perf is still NVIDIA’s recommended tool for LLM load testing.
Reality: According to NVIDIA Triton Docs, GenAI-Perf is deprecated with no new features being added. NVIDIA now directs teams to NVIDIA AIPerf (GitHub: ai-dynamo/aiperf) as the active replacement. Teams starting new benchmarking projects should go directly to AIPerf.
One Sentence to Remember
GenAI-Perf established what LLM-specific load testing looks like — streaming-aware, metric-granular, endpoint-agnostic — and NVIDIA AIPerf carries that design forward as the actively maintained tool.
FAQ
Q: What four metrics did GenAI-Perf measure?
A: GenAI-Perf tracked time to first token (TTFT), inter-token latency (ITL), tokens per second (TPS), and requests per second (RPS) — the four streaming-specific signals that general load testing tools do not capture.
Q: Is GenAI-Perf still maintained?
A: No. According to NVIDIA Triton Docs, GenAI-Perf is deprecated and no new features are being developed. NVIDIA AIPerf, from the ai-dynamo GitHub organization, is the current replacement.
Q: How does LLM load testing differ from standard API load testing?
A: Standard tools measure total response time. LLM APIs stream tokens, so TTFT and inter-token latency are separate failure modes — two signals that a single response-time average collapses into one misleading number.
Sources
- NVIDIA Triton Docs: GenAI-Perf — NVIDIA Triton Inference Server (deprecation notice) - Official documentation with deprecation notice and migration guidance
- AIPerf GitHub: ai-dynamo/aiperf: Successor to GenAI-Perf - Active NVIDIA repository for the GenAI-Perf successor tool
Expert Takes
GenAI-Perf operationalized a key insight: LLM inference is a streaming process, not a request-response transaction. TTFT and inter-token latency are distinct failure modes — a model can deliver fast TTFT with degraded ITL when token generation rate drops mid-stream. Measuring both separately gave teams visibility into which bottleneck they were actually hitting. That measurement architecture is exactly what AIPerf continues.
The transition from GenAI-Perf to AIPerf is a CLI swap, not a conceptual redesign. Both tools assume the same testing contract: send concurrent requests to an OpenAI-compatible endpoint, capture per-request streaming metrics, report percentile distributions. If you have existing GenAI-Perf scripts, migrating to AIPerf means swapping the package install and CLI name — the test parameters and metric output format follow the same conventions.
Deprecation of a NVIDIA open-source tool matters because NVIDIA controls the hardware stack — when they shift tooling, the inference community follows. GenAI-Perf becoming a legacy tool tells teams two things: LLM benchmarking has matured enough to warrant a purpose-built successor, and the bar for production-grade load testing is rising. Teams still measuring with curl loops are not measuring what actually matters.
GenAI-Perf measured what the tool could observe — time, tokens, concurrency. But the metrics it tracked say nothing about what happens when the model gets overloaded and quietly degrades output quality rather than timing out. Load testing tells you whether your infrastructure holds. It does not tell you whether what the model returns under load is still correct, coherent, or safe. That question has no CLI flag.