Bradley Terry Model
Also known as: BT Model, Bradley-Terry, BTL Model
- Bradley Terry Model
- A probabilistic framework that converts pairwise preference comparisons into numerical strength scores. Originally from statistics (1952), it now serves as the standard mathematical loss function for training reward models in reinforcement learning from human feedback.
The Bradley-Terry model is a statistical framework that converts pairwise preference comparisons into numerical strength scores, forming the mathematical foundation for training reward models used in LLM alignment.
What It Is
When you ask an AI assistant to rewrite an email and it produces two versions, someone has to decide which version is “better.” But how do you turn thousands of those individual “I prefer A over B” judgments into a single scoring system that a machine can learn from? That is exactly the problem the Bradley-Terry model solves, and it sits at the heart of how modern language models learn to follow human preferences.
The Bradley-Terry model, originally published in 1952 by Ralph Bradley and Milton Terry, is a probabilistic model that estimates the relative strength of items based only on pairwise comparisons. Think of it like a chess rating system: you don’t need every player to compete against every other player. From a subset of head-to-head matches, the model infers a latent “strength” score for each player. The key formula is straightforward: the probability that item i beats item j equals the sigmoid of the difference between their scores. The sigmoid function maps any number to a probability between 0 and 1, so a large positive difference means a near-certain win.
According to Wikipedia, only the difference between scores matters — adding the same constant to every score changes nothing about the predicted outcomes. This property makes the model especially useful for ranking where absolute values are meaningless but relative ordering is everything.
In the context of reward model architecture and LLM alignment, the Bradley-Terry model provides the loss function that trains a reward model. Human annotators compare pairs of model outputs and label which response they prefer. The reward model then learns to assign scores to responses such that the preferred response reliably gets a higher score, exactly as the Bradley-Terry formula predicts. According to the RLHF Book, this same mathematical assumption is also implicit in Direct Preference Optimization (DPO), a popular alternative to traditional RLHF that skips the explicit reward model step entirely.
According to Sun et al., a key assumption behind the Bradley-Terry framework is the “independence from irrelevant alternatives” — the probability that you prefer A over B doesn’t change based on whether option C exists. This is a useful simplification, but it doesn’t always hold for language outputs where context and framing shift preferences.
How It’s Used in Practice
The most common place you encounter Bradley-Terry scoring is inside the RLHF pipeline that fine-tunes large language models. Companies training AI assistants collect thousands of pairwise comparisons: human raters see two responses to the same prompt and pick the one that is more helpful, more accurate, or less harmful. These binary preferences feed directly into a reward model whose training loss is derived from the Bradley-Terry formula.
You also see Bradley-Terry scores powering public model leaderboards and evaluation platforms. When users vote on which AI response they prefer in a blind comparison, the platform uses Bradley-Terry scoring (or its extension, Elo ratings) to rank models by inferred quality — the same math that ranks chess players.
Pro Tip: If you’re evaluating AI models for your team, look for benchmarks that use pairwise preference scoring rather than single-score ratings. Pairwise comparisons reduce the “everyone gets a 4 out of 5” problem and produce more reliable rankings.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Training a reward model from human preference labels | ✅ | |
| Ranking items where only pairwise comparisons are available | ✅ | |
| Building an evaluation benchmark with blind A/B voting | ✅ | |
| Tasks where preferences depend heavily on which other options are present | ❌ | |
| Single-item scoring where absolute quality matters (e.g., safety classification) | ❌ | |
| Small datasets with fewer than a few hundred comparison pairs | ❌ |
Common Misconception
Myth: The Bradley-Terry model tells you how “good” a response is on an absolute scale. Reality: It only measures how items compare relative to each other. A score of 3.2 means nothing by itself — it only becomes meaningful when compared against another item’s score. This is why reward models trained with Bradley-Terry can assign high scores to mediocre responses if all alternatives in the training set were worse.
One Sentence to Remember
The Bradley-Terry model is the math that turns “I prefer A over B” into a score your reward model can learn from — relative rankings, not absolute quality, and that distinction matters every time you evaluate or fine-tune an LLM.
FAQ
Q: How is the Bradley-Terry model related to Elo ratings? A: Elo ratings are a sequential application of the same core idea — predicting win probability from score differences. Elo updates after each match, while Bradley-Terry fits all comparisons at once for a global ranking.
Q: Does the Bradley-Terry model work for comparisons with ties? A: The basic version only handles strict preferences. Extensions like the Davidson model add a tie parameter, but most RLHF implementations force annotators to choose a winner or discard tied pairs.
Q: Can Bradley-Terry handle more than two options at once? A: Not directly. It models pairwise comparisons only. For multi-way rankings, you decompose them into pairs or use extensions like the Plackett-Luce model, which generalizes Bradley-Terry to ordered lists.
Sources
- Wikipedia: Bradley-Terry model - Original formulation, mathematical properties, and historical context
- Sun et al.: Rethinking Bradley-Terry Models in Preference-Based Reward Modeling - ICLR 2025 paper examining BT assumptions and classification-based alternatives
Expert Takes
The Bradley-Terry model rests on one elegant assumption: preference is a function of latent scalar difference passed through a sigmoid. This forces transitivity — if A beats B and B beats C, then A must beat C. Real human preferences violate this regularly, especially in subjective language tasks. The model works not because the assumption is true, but because the approximation produces trainable gradients that converge reliably.
When you wire a Bradley-Terry loss into your reward model training loop, the practical benefit is that the loss function is stable and well-understood. Swap in cross-entropy over preference pairs, set your learning rate, and the gradients behave predictably. If your reward model produces inconsistent rankings, check your annotation pipeline quality and label agreement rates first. The math is rarely where things break.
Every major AI lab uses some version of Bradley-Terry scoring to train their flagship models. The companies that collect the most preference data and clean it well end up with the strongest reward signal. Data annotation quality has quietly become a strategic differentiator — the model architecture is published, but the preference dataset is the moat nobody can copy.
The Bradley-Terry model assumes a single linear scale of quality. But “better” is not one-dimensional. A response can be more helpful but less safe, more creative but less accurate. Collapsing these tradeoffs into one number means someone chose which values to prioritize — and that choice is buried in annotation guidelines, invisible to the person who trusts the output.