Min P Sampling

Also known as: min-p, minimum probability sampling, min_p

Min P Sampling
A dynamic token filtering strategy that removes low-probability candidates during text generation by setting a threshold relative to the most likely token’s probability, automatically tightening selection when the model is confident and loosening it when multiple tokens are equally plausible.

Min-p sampling is a dynamic token filtering method that adjusts which candidate words remain available during text generation based on the probability of the most likely token, providing finer control than fixed-threshold alternatives.

What It Is

When a language model generates text, it produces a probability distribution over its entire vocabulary. Raw scores (logits) pass through softmax to become probabilities, and temperature scales how spread out those probabilities are. But after temperature does its work, you still need to decide which tokens to actually sample from. That decision shapes whether your output reads as coherent, creative, or garbled.

Traditional approaches like top-k (keep the top K tokens) or top-p/nucleus sampling (keep the smallest set of tokens whose cumulative probability exceeds a threshold) use fixed cutoffs. The problem: a fixed cutoff doesn’t adapt to context. When the model is highly confident about the next word, you want tight filtering. When many words are equally plausible, you want to keep more candidates open.

Think of it like a velvet-rope policy at a venue that adjusts based on who’s at the front of the line. If the person at the door is a clear VIP (the top token has very high probability), only other VIPs get in. If nobody stands out much, the rope loosens and more guests pass through.

According to Nguyen et al., min-p sampling works by multiplying a fixed parameter (min_p, typically between 0.01 and 0.1) by the probability of the most likely token. Any token whose probability falls below that product gets removed from the candidate pool. The surviving tokens are then renormalized and sampled from. This means the effective threshold automatically rises when the model is confident and drops when the model is uncertain — exactly the behavior you want for balancing coherence with variety.

This directly complements temperature. Temperature reshapes the probability distribution. Min-p then decides where to draw the filtering line on that reshaped distribution. Together, they give you two independent dials: one for how spread out probabilities are, one for how aggressively low-probability tokens get cut.

How It’s Used in Practice

Most people encounter min-p sampling when configuring local LLM inference through tools like llama.cpp, Hugging Face Transformers, or vLLM. If you’ve opened a model’s generation settings and spotted a min_p slider alongside temperature and top-p, that’s this method. According to Nguyen et al., min-p has been adopted across major open-source inference frameworks.

In practice, min-p works as a partner to temperature. Temperature controls how flat or peaked the probability distribution is. Min-p then filters out tokens below a dynamic floor. A common workflow: set temperature to your desired creativity level, then set min-p to prevent incoherent tokens from slipping through without killing diversity. Unlike top-p, min-p self-adjusts at every token position, so it handles both confident and uncertain model states without manual tuning.

Pro Tip: Start with a small min_p value (around 0.05) for general-purpose tasks. If outputs feel too random at higher temperatures, nudge it upward. If outputs feel too repetitive, lower it slightly. Because min-p adapts per token, you’ll rarely need to change it once you find a baseline that works for your use case.

When to Use / When Not

ScenarioUseAvoid
Creative writing with high temperature settings
Deterministic tasks like code generation or structured output
Local model inference where you control sampling parameters
Commercial APIs that don’t expose min-p as a parameter
Pairing with temperature to get varied but coherent responses
Situations where top-p already gives satisfactory results

Common Misconception

Myth: Min-p sampling replaces temperature — you pick one or the other. Reality: Min-p and temperature work at different stages. Temperature reshapes the probability distribution (makes it sharper or flatter). Min-p then filters the reshaped distribution by removing tokens below a dynamic threshold. They’re complementary controls, not alternatives. Using both gives you finer-grained control over generation quality than either one alone.

One Sentence to Remember

Min-p sampling automatically tightens the filter when the model knows what comes next and loosens it when multiple words make equal sense — giving you temperature’s creativity without its chaos.

FAQ

Q: How is min-p sampling different from top-p (nucleus) sampling? A: Top-p uses a fixed cumulative probability threshold. Min-p uses a dynamic threshold that scales with the top token’s probability, so it adapts automatically to the model’s confidence at each generation step.

Q: Can I use min-p sampling with ChatGPT or Claude? A: Not directly through their official APIs at this time. Min-p is primarily available in open-source inference frameworks like llama.cpp, Hugging Face Transformers, and vLLM.

Q: What’s a good default value for min-p? A: According to Wand AI Blog, a starting value of 0.05 works well for most tasks. Increase it for more focused outputs or decrease it when you want wider diversity in responses.

Sources

Expert Takes

Min-p sampling addresses a mathematical limitation in static truncation methods. Top-k ignores the shape of the distribution entirely. Top-p respects cumulative mass but applies the same threshold regardless of entropy. Min-p scales its cutoff proportionally to the mode of the distribution, which means it preserves more candidates in high-entropy contexts and fewer in low-entropy ones. The result: sampling behavior that tracks the model’s own uncertainty signal.

If you run local models, add min_p to your generation config alongside temperature. The setup is one parameter — no pipeline changes needed. Where it pays off most: high-temperature creative workflows where top-p alone lets through nonsense tokens. Set a low min_p value, pair it with your existing temperature setting, and test on your actual prompts. The combination handles edge cases that either parameter alone misses.

Min-p matters because it signals how the open-source inference stack is evolving independently from closed API providers. The method spread through llama.cpp and Hugging Face before any major commercial API adopted it. For teams running their own models, that’s a real capability gap — better output quality from a single parameter change. For teams locked into commercial APIs, this is another reason to evaluate what control you’re giving up.

Every sampling method is a choice about which possibilities get silenced. Top-k draws an arbitrary line. Top-p draws a less arbitrary but still fixed line. Min-p draws a line that moves — but who decided the formula for how it moves? The researchers, not the user. More adaptive filtering sounds like progress until you ask: adaptive toward whose definition of coherent? The tokens discarded at high confidence are still valid language. We just decided they don’t belong.