Inference

Also known as: model inference, AI inference, LLM inference

Inference: The process of running a trained machine learning model on new input data to produce a prediction or output. For large language models, inference uses autoregressive decoding — generating text one token at a time, with each token conditioned on all preceding tokens.

Inference is the process of using a trained AI model to generate outputs from new inputs — for large language models, this means producing text one token at a time through autoregressive decoding.

What It Is

Every AI model goes through two distinct stages in its lifecycle: training and inference. Training is the expensive, time-consuming part where the model learns patterns from massive datasets. Inference is what happens after — the moment you actually use the model to get a result. When you type a question into ChatGPT, Claude, or any AI assistant, the model runs inference to produce your answer.

Think of it like learning to cook versus actually cooking dinner. Training is reading cookbooks, studying techniques, and practicing for months. Inference is walking into the kitchen and making a meal for someone. How good the meal tastes depends on your training, but how fast and cheaply you can serve it depends on your inference setup.

For large language models, inference works through autoregressive decoding. The model doesn’t produce an entire response at once. Instead, it predicts one token (roughly a word or word fragment) at a time, feeding each prediction back as input to generate the next one. This is why you see text appearing word by word in AI chat interfaces — that’s autoregressive inference happening in real time.

According to BentoML Handbook, LLM inference has two distinct phases: prefill, where the model processes your entire prompt, and decode, where it generates tokens one at a time. The prefill phase is compute-intensive because it processes everything you sent in parallel. The decode phase is memory-intensive because it needs to store and access the attention history — called the KV cache — for every token generated so far.

That KV cache is one of the biggest bottlenecks in inference. As conversations grow longer, the cache grows with them, consuming GPU memory that could serve more users. According to the PagedAttention paper, traditional KV cache management wastes substantial memory through fragmentation, but PagedAttention reduces that waste to near zero by borrowing memory management ideas from operating systems.

How It’s Used in Practice

The most common place you encounter inference is every time you interact with an AI assistant. When you ask Claude to draft an email, write code, or answer a question, the application sends your prompt to a server running the model. The model runs inference to generate a response, and you see the result stream back token by token. That streaming effect isn’t a visual trick — it reflects the actual autoregressive generation happening on the server.

Behind the scenes, production systems run inference for thousands of users at once. Techniques like continuous batching group multiple requests together so the GPU stays busy instead of sitting idle between generations. This is why AI services can handle millions of conversations without needing a separate GPU for each user.

Pro Tip: When evaluating AI tools, pay attention to Time to First Token (TTFT) — the delay between sending your prompt and seeing the first word appear. According to NVIDIA Blog, TTFT includes both scheduling delay and prefill compute. A tool that feels sluggish often has a TTFT problem, not a generation speed problem.

When to Use / When Not

Scenario	Use	Avoid
Generating text responses in an AI chat assistant	✅
Classifying incoming support tickets automatically	✅
Training a model from scratch on new data		❌
Real-time translation during a live video call	✅
Fine-tuning a model’s weights on domain-specific data		❌
Batch-processing documents overnight for summaries	✅

Common Misconception

Myth: Inference is just “running the model” and there’s nothing to optimize — the hard work is all in training. Reality: For production AI systems, inference is typically the larger ongoing expense. Training happens once (or periodically during fine-tuning), but inference runs every single time a user sends a request. Optimizing inference — through better batching, smarter memory management, and targeted hardware selection — often determines whether an AI product is financially viable at scale.

One Sentence to Remember

Inference is the moment AI goes from learned knowledge to actual output — and for LLMs, that output arrives one token at a time, making every millisecond of the generation loop a target for optimization.

FAQ

Q: What is the difference between training and inference in AI? A: Training teaches the model by adjusting its parameters on large datasets. Inference uses the trained model to generate predictions or outputs from new inputs without changing the model’s weights.

Q: Why does LLM inference generate text one word at a time? A: LLMs use autoregressive decoding, where each token is predicted based on all previous tokens. The model can’t produce the full response at once because each word depends on the words before it.

Q: What makes inference slow for large language models? A: The decode phase is memory-bound, not compute-bound. The model must read the growing KV cache — stored attention history — for each new token, so longer outputs take proportionally longer to generate.

Sources

NVIDIA Blog: Mastering LLM Techniques: Inference Optimization - Technical guide covering inference optimization strategies, batching, and key performance metrics for LLM serving
BentoML Handbook: Key Metrics for LLM Inference - Breakdown of inference phases, latency metrics, and throughput measurement for production deployments

Expert Takes

MONA

Inference is where probability meets language. During the decode phase, the model produces a probability distribution across its entire vocabulary for every single token position. The selected token feeds back as input, making each generation step dependent on all preceding choices. This autoregressive loop is why inference speed scales linearly with output length — and why optimizing the attention mechanism’s memory footprint matters more than raw compute for serving large models.

MAX

Every time you send a prompt to an LLM, two things happen in sequence: prefill processes your entire input, then decode generates the response token by token. The prefill stage is compute-heavy; decode is memory-heavy. Knowing this split matters because your optimization strategy depends on which phase is the bottleneck. Long prompts with short answers? Optimize prefill. Chatbots generating paragraphs? Focus on decode throughput and KV cache efficiency.

DAN

Inference cost is where the real money sits in AI. Training happens once; inference runs every time a user hits “send.” Companies serving AI products at scale spend far more on inference than they ever spent on training. The entire optimization stack — from dedicated hardware to batching strategies — exists because shaving milliseconds off each request translates directly to lower operating costs and faster user experiences.

ALAN

The speed of inference shapes who gets to participate in AI. When inference costs are high, only well-funded organizations can deploy large models. When those costs drop, access widens. But faster inference also means faster generation of misinformation, deepfakes, and automated manipulation. Every optimization that makes inference cheaper lowers the barrier for both beneficial and harmful uses — and we have no reliable way to selectively slow down only the dangerous ones.

Back to Glossary