Logits

Also known as: raw scores, unnormalized scores, pre-softmax scores

Logits
Logits are the raw numerical scores a language model produces for every possible next token before those scores are converted into probabilities through the softmax function.

Logits are the raw numerical scores a language model assigns to every possible next token before converting them into probabilities — the starting point for all sampling methods like Top-K and Top-P.

What It Is

If you’ve ever adjusted a temperature slider, picked a sampling method, or wondered why a model chose one word over another, you’ve been dealing with the effects of logits without seeing them directly. Logits are the raw scores a language model assigns to every token in its vocabulary at each generation step, and they are the foundation that every generation control depends on.

Think of logits like raw votes in an election before any counting rules are applied. Each token gets a score — some strongly positive, some near zero, some negative. A token with a high logit is strongly favored; one with a negative logit is an unlikely continuation. But these numbers aren’t probabilities. They don’t add up to one, and they don’t tell you “there’s a 73% chance this word comes next.”

That conversion from raw scores to probabilities happens through the softmax function. Softmax exponentiates each logit and then normalizes the results so they sum to one. After softmax, every token has a proper probability — and only then can sampling methods like Top-K, Top-P, or Min-P decide which tokens stay in the running and which get discarded.

A modern language model’s vocabulary contains tens of thousands of tokens. At each generation step, the model computes a logit for every single one. Most logits will be low or negative, meaning the model considers those tokens poor continuations. A few will be high, representing the strongest candidates. The spread between the highest logit and the rest reflects how confident the model is about its next choice.

Temperature — one of the most common generation controls — works directly on logits before softmax. A low temperature divides each logit by a small number, amplifying the differences between scores and producing a sharper probability distribution. A high temperature compresses those differences, flattening the distribution so less-favored tokens receive a larger share of probability mass. The logits themselves don’t change; temperature changes how they get interpreted.

How It’s Used in Practice

When you adjust parameters like temperature, Top-K, or Top-P in a chat interface or API call, you’re controlling what happens to logits after the model produces them. The model always generates the same set of raw scores for a given input and internal state. What changes is how those scores get filtered and reshaped before a token is selected.

Some APIs expose logit-related data directly. This is useful for tasks like classification, where instead of generating free text, you check which class-label token received the highest logit. Developers building evaluation tools or custom sampling strategies often work with logit values to understand exactly how confident the model was at each generation step.

Pro Tip: If the model’s output feels too random, lower the temperature to sharpen the logit distribution. If it feels repetitive, raise the temperature to let lower-ranked tokens compete. You’re not changing the logits — you’re changing how aggressively the weaker candidates get suppressed.

When to Use / When Not

ScenarioUseAvoid
Adjusting generation parameters like temperature or Top-P
Building a custom sampling or filtering strategy
Using a chat app where parameters are preset and hidden
Classifying inputs by checking which label token scores highest
Comparing raw outputs across models with different vocabularies
Debugging why a model chose an unexpected word

Common Misconception

Myth: Logits are probabilities — a logit of 10 means the model is 10% confident about that token. Reality: Logits are unbounded raw scores that can be any real number, including negative values. They only become probabilities after softmax normalization. A logit of 10 means the model strongly favors that token relative to others, but it says nothing about percentage confidence until you see the full distribution.

One Sentence to Remember

Logits are the raw votes the model casts before any sampling rule decides who wins — every generation parameter you tweak is really just a different way of counting those votes.

FAQ

Q: What is the difference between logits and probabilities in a language model? A: Logits are raw, unnormalized scores that can be any real number, including negatives. Probabilities come from applying softmax to logits, producing values between zero and one that sum to one.

Q: How does temperature affect logits? A: Temperature divides each logit by a scaling factor before softmax. Lower values sharpen the distribution toward top candidates. Higher values flatten it, letting less-favored tokens compete for selection.

Q: Can I access logits from an API call? A: Many LLM APIs offer a logprobs parameter that returns log-probabilities — values closely related to logits — for top candidate tokens at each step, letting you inspect model confidence.

Expert Takes

Logits are the output of the final linear projection layer in a transformer. They represent the dot product between the hidden state and each token embedding in the vocabulary matrix. The softmax function converts these scores into a categorical probability distribution. Every sampling strategy — greedy, Top-K, nucleus, beam search — operates downstream of this transformation. The logit vector’s dimensionality equals the vocabulary size, which is why this step is a consistent bottleneck in token generation.

When you lower the temperature and wonder why the model keeps repeating itself, you’re looking at the symptom, not the cause. The real question is what the logit distribution looked like before your settings touched it. Check the top logprobs in the API response. If the gap between the top-ranked token and its nearest competitor is already large, lowering temperature further just amplifies a decision the model already made. Adjust where adjustment still matters.

Every time a product team debates temperature settings or Top-P values, they’re arguing about what to do with the logits — they just don’t call it that. Understanding logits gives you a shared vocabulary for what’s actually happening when you “tune” a model. Teams that skip this concept end up cargo-culting parameter values from blog posts. The ones that get it ship faster because they debug with data instead of guesswork.

There’s a question buried under the math. When a model produces logits, it encodes not just language patterns but every bias present in its training data. High logits for certain continuations and near-zero logits for others reflect what the model was taught to expect — and by whom. Sampling methods choose from what the logits offer, but they cannot correct for what was never offered in the first place. The distribution was decided long before inference began.