Few-Shot Learning

Also known as: few-shot prompting, in-context learning, ICL

Few-Shot Learning
A prompting technique where a small number of input-output examples are included in the prompt to guide an LLM’s response, allowing the model to recognize task patterns without any retraining or weight updates.

Few-shot learning is a prompting technique where you provide a language model with a handful of input-output examples so it understands the task pattern before answering a new question.

What It Is

When you ask a language model to classify a customer complaint or format data in a specific way, the model has no memory of how you want it done. Few-shot learning solves this by placing a small number of worked examples — typically one to five — directly into the prompt before your actual question. The model reads these examples, picks up the pattern, and applies it to the new input. Instead of writing detailed instructions about every edge case, you show the model what “right” looks like.

Think of it like showing a new colleague three completed expense reports before handing them a blank form. They don’t need a training course — they just follow the pattern they’ve seen.

What makes few-shot learning distinct from traditional machine learning is that no retraining happens. According to Prompting Guide, the model learns from context only, with no parameter changes. The examples live entirely in the prompt text. The model’s weights — its stored knowledge — stay exactly the same. This is why the technique is also called in-context learning, or ICL: the “learning” happens within the conversation window, not through any update to the model itself.

Few-shot learning sits on a spectrum of prompting strategies. Zero-shot means no examples at all — you describe the task and rely on the model’s pre-trained knowledge to figure out the format. One-shot uses a single example. Few-shot typically means two to five examples. The more examples you provide, the more context tokens (the units of text a model can process in a single prompt) you consume, so there’s a practical tradeoff between accuracy and prompt length. Choosing the right number often comes down to how ambiguous the task is — straightforward classification needs fewer examples than nuanced tone matching.

This technique became central to how we evaluate large language models. Benchmarks like MMLU use a standardized number of examples — according to DeepEval Docs, MMLU’s standard evaluation uses five examples per subject — to create a consistent testing condition across all models. Without few-shot prompting, benchmark scores would depend too heavily on how each model interprets instructions on its own, making fair comparison nearly impossible.

How It’s Used in Practice

The most common place you’ll encounter few-shot learning is in everyday prompt engineering. When you paste two or three examples of the output format you want into ChatGPT, Claude, or any other AI assistant, you’re doing few-shot prompting. Product managers use it to get consistent classification results — paste three customer messages labeled as bug, feature request, or question, then ask the model to classify a new one. Developers embed few-shot examples in automated pipelines to enforce output structure without the cost and complexity of fine-tuning a custom model.

In model evaluation, few-shot learning is the standard testing condition. When a research team reports how a model scores on MMLU, that score almost always comes from five-shot evaluation. This matters because a model’s zero-shot performance can differ significantly from its few-shot performance. Comparing a zero-shot score from one model against a five-shot score from another would produce misleading rankings, which is why evaluation protocols specify the exact number of examples used.

Pro Tip: Start with two or three diverse examples that cover edge cases, not just easy inputs. If you only show the model straightforward examples, it won’t know how to handle ambiguous ones — and those are where few-shot prompting pays off most.

When to Use / When Not

ScenarioUseAvoid
Enforcing a consistent output format across multiple prompts
Simple factual questions the model already handles well
Classifying items into categories you define
Task requires deep domain expertise not present in training data
Standardizing evaluation conditions for benchmark testing
Prompt is already near the context window limit

Common Misconception

Myth: Few-shot learning teaches the model something new permanently — after seeing your examples, it “knows” the pattern for future conversations.

Reality: The model retains nothing between sessions. Every conversation starts from zero. Your examples only influence the current prompt window. Close the chat, and the pattern is gone. This is fundamentally different from fine-tuning, where the model’s weights are actually updated with new training data.

One Sentence to Remember

Few-shot learning is showing, not telling — a handful of examples in your prompt teaches the pattern for that conversation only, and benchmarks like MMLU depend on this technique to make model comparisons fair and repeatable.

FAQ

Q: How many examples should I include in a few-shot prompt? A: Two to five examples usually work well. More examples improve consistency but consume context tokens, so balance accuracy against prompt length for your specific task.

Q: What is the difference between few-shot learning and fine-tuning? A: Few-shot learning puts examples in the prompt with no model changes. Fine-tuning permanently updates the model’s weights using a training dataset, requiring compute resources and technical setup.

Q: Why does MMLU use few-shot prompting instead of zero-shot? A: Few-shot prompting reduces format ambiguity. Providing examples ensures the model understands the expected answer structure, making scores more comparable across different models.

Sources

Expert Takes

Few-shot learning is pattern recognition applied to the input sequence, not genuine learning. The model performs conditional text generation where each example narrows the probability distribution for the next token. More examples reduce output variance, which is precisely why benchmarks like MMLU standardize the example count across all subjects. The mechanism is statistical inference within a fixed context window, not knowledge acquisition.

In any context-driven workflow, few-shot examples function as a specification layer. They define the expected input-output contract more precisely than instructions alone. When building automated evaluation pipelines, fixing the number of examples across all test runs is what makes results reproducible. Treat your examples as part of your spec, version them, and review them the same way you review code.

Few-shot prompting is the first technique most teams reach for, and the last one many bother to optimize. The gap between a casually written prompt and a well-structured few-shot prompt can mean the difference between a demo that impresses and a product that actually ships. Teams that treat prompt construction as engineering discipline rather than afterthought pull ahead fast.

The examples you choose shape the answers you get. Few-shot prompting encodes your assumptions about what a correct answer looks like. When evaluation benchmarks pick specific examples, they are deciding what counts as knowledge. That selection process often goes unexamined. A model scored on carefully curated examples might perform very differently on messy, real-world questions that don’t fit the expected template.