Pool Based Sampling

Also known as: pool-based active learning, pool sampling, pool-based query strategy

Pool Based Sampling
Pool-based sampling is an active learning approach where the algorithm has access to a large collection of unlabeled examples at once, evaluates their informativeness using a query strategy, and selects the most valuable ones for a human to label.

Pool-based sampling is the most common active learning setup: the model scores every example in a large pool of unlabeled data and sends only the most informative ones to a human for labeling.

What It Is

Labeling data is the expensive part of building a machine learning model. You often sit on a mountain of raw, unlabeled examples — support tickets, product images, transcripts — but a human has to read each one and assign a label, and that time costs real money. Pool-based sampling decides which examples are worth that money. Instead of labeling everything or labeling at random (which wastes effort on easy, obvious cases), the model looks at the whole pool and points to the handful of examples where a label would teach it the most.

Think of a student with a giant stack of practice questions and only one evening to study. A smart student flips through the whole stack first, skips the questions they already know cold, and spends their limited time on the ones that confuse them. Pool-based sampling is the model behaving that way: it surveys the entire pool before asking for help.

The mechanics are a loop. The model trains on whatever small labeled set exists so far, then runs predictions across the unlabeled pool and assigns each example an informativeness score using a query strategy — a rule for measuring how useful a label would be. The highest-scoring examples go to a human (or oracle) for labeling, the new labels join the training set, the model retrains, and the cycle repeats. The “pool” in the name means the candidates sit in a static collection the model can examine in full at any time, rather than streaming past one by one.

That query strategy is where pool-based sampling connects to uncertainty sampling. Uncertainty sampling is one way to score the pool: rank each example by how unsure the model is about it, using measures like entropy, margin, or least confidence, then pick the most uncertain ones. Pool-based sampling is the surrounding setting; uncertainty sampling is a strategy you plug into it. You can swap in others — such as query-by-committee, where several models vote and you pick the examples they disagree on most — without changing the loop around them.

How It’s Used in Practice

The everyday scenario is a team building a classifier on a tight labeling budget — say, sorting customer messages into categories, or flagging defective parts from photos. They have tens of thousands of unlabeled records and budget to label only a small fraction. Rather than hand annotators a random sample, they wrap the job in a pool-based active learning loop: train on a small seed set, score the full pool, send annotators the top-ranked examples, retrain, repeat. Each round, annotator effort lands where it moves the needle.

Modern data-labeling platforms bake this in. You connect your unlabeled pool, pick a query strategy, and the tool surfaces the next batch to label and tells you when extra labels stop helping. The same idea shows up when curating examples for fine-tuning or evaluation sets — score a big pool of candidates, label the most informative ones instead of all of them.

Pro Tip: Don’t re-score the entire pool from scratch every round if it’s huge — that gets slow fast. Score a large random subset each iteration instead. You keep almost all the benefit of pool-based selection while cutting the compute per round, which matters once your pool runs into the millions.

When to Use / When Not

ScenarioUseAvoid
Large unlabeled pool, small labeling budget
Labels stream in real time and must be decided instantly
Annotation is the main cost and you want to minimize it
You already have abundant cheap labels for everything
Model can be retrained between labeling rounds
The pool is so small you can just label all of it

Common Misconception

Myth: Pool-based sampling is itself a method for choosing which examples to label, so it competes with uncertainty sampling.

Reality: Pool-based sampling is the setting, not the selection rule. It describes the fact that all candidate examples sit in a fixed pool the model can inspect in full. The choice of which examples to pick comes from a separate query strategy — uncertainty sampling, query-by-committee, and others — that runs inside the loop. They work together, not against each other.

One Sentence to Remember

Pool-based sampling is the “look at everything, then ask about the hard parts” setup for active learning — get the loop in place first, then experiment with query strategies like uncertainty sampling to choose which parts to ask about.

FAQ

Q: What is the difference between pool-based and stream-based sampling? A: Pool-based sampling examines a full, static collection of unlabeled examples and picks the best ones to label. Stream-based sampling sees examples one at a time and must decide to label or discard each on arrival.

Q: Is pool-based sampling the same as uncertainty sampling? A: No. Pool-based sampling is the overall setting where examples sit in a fixed pool. Uncertainty sampling is one query strategy that runs inside it, scoring examples by the model’s confidence.

Q: When does pool-based sampling not help? A: When labels are cheap and abundant, when examples must be judged in real time as they arrive, or when the pool is small enough to label entirely. In those cases the selection overhead outweighs the savings.

Expert Takes

Not a selection rule. A setting. Pool-based sampling only states that every candidate example is available for inspection at once. What it buys you is a global view: the model can rank the full pool before committing scarce labeling effort. The intelligence lives in the query strategy you run on top — the pool is simply the stage on which uncertainty or disagreement measures get to act.

Treat the pool-based loop as your labeling pipeline’s contract: train, score the pool, request labels, retrain, repeat. Specify the query strategy as a swappable component, not a hard-coded step. That way you can start with uncertainty sampling and switch to committee disagreement later without rebuilding the loop. The structure stays fixed; the scoring rule plugs in where your spec says it does.

Labeling budgets are where machine learning projects quietly bleed money. Pool-based sampling turns annotation spend into a targeted investment instead of a flat tax on every example. Teams that wrap their labeling in this loop get usable models on a fraction of the annotation cost. In a market where data is the moat, spending labels where they count is a competitive edge.

A pool-based loop concentrates human attention on whatever the model finds confusing — which sounds efficient until you ask what gets ignored. Examples the model is wrongly confident about never surface for review, so blind spots harden quietly. Who decides the query strategy, and who audits what the pool keeps skipping? Efficiency in labeling is not the same as fairness in what finally gets learned.