Beam Search

Also known as: beam decoding, beam search decoding, beam search algorithm

Beam Search
A heuristic decoding algorithm that maintains multiple candidate sequences (beams) during text generation, expanding and scoring them at each step to find a high-probability output sequence without exhaustively searching every possibility.

Beam search is a decoding algorithm that explores multiple candidate output sequences simultaneously, selecting the most probable complete sequence rather than choosing one word at a time.

What It Is

When an encoder-decoder model translates a sentence or generates a summary, it produces output one token at a time. At each step, the model assigns probabilities to thousands of possible next words. The simplest approach — greedy search — picks the highest-probability word at every step and moves on. That sounds reasonable, but it often paints the model into a corner. A word that looks great right now can lead to an awkward or low-quality sentence three steps later.

Beam search fixes this by keeping multiple options open. Think of it like planning a road trip with a map that only reveals the next intersection as you arrive. Greedy search always takes the road that looks shortest right now. Beam search sends several scouts down different routes and picks the best complete journey at the end.

Here is how it works. You set a beam width (often written as k), which controls how many candidate sequences the algorithm tracks at once. According to Dive into Deep Learning, when k equals 1 beam search reduces to greedy search, and typical values range from 4 to 10 for machine translation tasks. At each decoding step, the algorithm expands every current candidate by considering all possible next tokens, scores the resulting sequences by their cumulative probability, and keeps only the top k. This repeats until each candidate reaches an end-of-sequence token or a maximum length.

One important detail: longer sequences naturally have lower cumulative probabilities because you multiply probabilities together at each step. To prevent the algorithm from always favoring short outputs, most implementations apply length normalization — dividing the total score by a function of the sequence length. This keeps comparisons fair between a concise three-word answer and a detailed twelve-word answer.

The result is not guaranteed to be the single best possible sequence — finding that would require checking every combination, which is computationally impossible for large vocabularies. Beam search is a practical trade-off: it explores enough of the search space to consistently produce better output than greedy decoding, without the exponential cost of an exhaustive search.

How It’s Used in Practice

Beam search is the standard decoding strategy for encoder-decoder models handling structured output tasks. If you have used a machine translation service or an automatic summarization tool, beam search was almost certainly running behind the scenes. According to Hugging Face Docs, it remains standard for encoder-decoder translation and summarization models, while stochastic sampling methods are preferred for open-ended text generation.

The distinction matters. Translation and summarization have a relatively narrow range of correct outputs for a given input — you want the most accurate one. Open-ended tasks like creative writing or chatbot responses benefit from variety, so methods like nucleus sampling introduce controlled randomness instead. This is why beam search is hard-coded in many translation pipelines but rarely used in general-purpose chatbots.

Pro Tip: If your beam search output sounds repetitive or generic, try increasing the beam width slightly and adding a length penalty. A width of 5 with a modest length penalty often hits a sweet spot between quality and speed for translation tasks. Going above 10 beams rarely improves output and slows inference noticeably.

When to Use / When Not

ScenarioUseAvoid
Machine translation where accuracy matters most
Creative writing or storytelling tasks
Summarization with a target length
Chatbot responses that need variety and personality
Structured data-to-text generation (reports, captions)
Latency-critical applications with tight compute budgets

Common Misconception

Myth: Beam search finds the single best possible output sequence. Reality: Beam search is an approximation. It tracks only k candidates at a time, so it can miss better sequences that pass through low-probability intermediate steps. It consistently outperforms greedy search, but it does not guarantee the globally optimal answer.

One Sentence to Remember

Beam search keeps multiple candidate sequences alive during generation and picks the winner at the end — think of it as “don’t commit to one path until you’ve seen where several roads lead,” which is exactly why encoder-decoder models rely on it for tasks where accuracy matters more than spontaneity.

FAQ

Q: What is the difference between beam search and greedy search? A: Greedy search picks the single highest-probability token at each step. Beam search tracks multiple candidates simultaneously and selects the best complete sequence, avoiding early mistakes that greedy search cannot recover from.

Q: Why don’t large language models use beam search for chat? A: Open-ended generation benefits from variety. Beam search tends to produce safe, repetitive text. Sampling methods like nucleus sampling introduce controlled randomness, making responses more natural and diverse for conversational use.

Q: How do I choose the right beam width? A: Start with 4 or 5 for most translation and summarization tasks. Wider beams improve quality up to a point but increase compute cost linearly. Values above 10 rarely help and can even degrade quality by favoring overly generic sequences.

Sources

Expert Takes

Beam search operates as a breadth-limited tree search over token sequences, where the branching factor equals the vocabulary size. The algorithm trades completeness for tractability — pruning all but k paths at each depth level. For conditional generation tasks with peaked output distributions, this pruning costs little in practice. For flat distributions, it systematically misses high-quality sequences hiding behind low-probability prefixes. The gap between beam search and exact inference remains an active research question.

If you are building a pipeline that calls an encoder-decoder model for translation or summarization, beam search is probably already the default in your framework — check before adding custom decoding logic. The parameter that matters most is beam width, and the right value depends on your latency budget. Run a quick benchmark: try different widths, measure the output quality difference, and decide if the extra inference time is worth it for your specific use case.

Beam search tells you something about where AI generation is headed commercially. Tasks that need deterministic, high-accuracy output — legal translation, medical summarization, structured reporting — still depend on it. Consumer-facing products moved toward sampling because users value personality over precision. Teams building B2B AI products should understand this split: your decoding strategy shapes your product experience as much as your model choice does.

The choice between beam search and sampling is not purely technical — it reflects a value judgment about what “good” output means. Beam search optimizes for the most probable sequence, which effectively means the most average, most expected answer. That works for translation where fidelity matters. But in contexts where AI generates advice or analysis, always choosing the most probable path can reinforce dominant viewpoints and flatten minority perspectives. The decoding method quietly shapes what gets said.