Inter Annotator Agreement
Also known as: IAA, inter-rater reliability, annotator agreement
- Inter Annotator Agreement
- Inter-annotator agreement is a measure of how consistently independent human labelers assign the same labels to the same data items, using metrics like percent agreement or chance-corrected scores such as Cohen’s kappa to quantify labeling reliability and the clarity of annotation guidelines.
Inter-annotator agreement (IAA) measures how consistently two or more human labelers assign the same label to the same data, revealing whether annotation guidelines are clear enough to produce reliable training data.
What It Is
Supervised machine learning models learn by example. Feed them thousands of labeled items — emails marked spam or not, images tagged with the object they contain, support tickets sorted by topic — and they learn to reproduce those labels on new data. But the whole process rests on a quiet assumption: that the labels themselves are correct and consistent. Inter-annotator agreement is the check on that assumption. It asks a simple question — if two people label the same item, do they pick the same answer? When they routinely don’t, the “ground truth” your model trains on is anything but.
A useful analogy is two radiologists reading the same X-ray. If they consistently reach the same diagnosis, you can trust the reading reflects something real in the image. If they disagree half the time, neither verdict is dependable — and the problem might be the image, their training, or fuzzy criteria for what counts as “abnormal.” Annotation works the same way. Agreement measures the reproducibility of the labeling process, not just the mood of one labeler on one afternoon.
The simplest measure is percent agreement: the share of items where labelers picked the same label. It’s intuitive but misleading, because some agreement happens by pure chance — especially when one label dominates the data. To correct for that, teams use chance-corrected metrics. Cohen’s kappa compares two annotators while subtracting the agreement you’d expect from random guessing. Fleiss’ kappa extends the idea to more than two annotators, and Krippendorff’s alpha handles missing data and different label types. The exact metric matters less than the habit it enforces: measure agreement before trusting the labels.
How It’s Used in Practice
The most common scenario is building or auditing a labeled dataset. A team writes annotation guidelines, then has several people independently label the same sample of items — often called a “gold” or overlap set. They compute agreement on that overlap. Strong agreement is a green light to let annotators work independently on the rest of the data. Weak agreement sends them back to rewrite the guidelines before scaling up, because every ambiguous instruction gets multiplied across thousands of labels.
The same check shows up wherever modern AI systems depend on human judgment. Teams labeling examples to fine-tune a model, reviewers rating which of two chatbot responses is better, and moderators tagging content as harmful all face the question of whether their labels are consistent enough to trust. When you hear that an evaluation set was “human-verified,” agreement is the number that tells you how much that verification is actually worth.
Pro Tip: Don’t chase a perfect agreement score. Some disagreement is signal, not noise — it often points to genuinely ambiguous items that deserve a clearer guideline or a third reviewer. Pull the items annotators split on and read them; that pile is usually the fastest route to a better dataset.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Building a dataset with subjective labels (sentiment, toxicity, relevance) | ✅ | |
| One annotator labels everything and no one double-checks a sample | ❌ | |
| Validating that annotation guidelines are clear before scaling up | ✅ | |
| Labels are purely mechanical with one obvious answer (copying a date field) | ❌ | |
| Comparing the labeling quality of two annotation vendors or teams | ✅ | |
| You have no overlap set — every item was labeled by only one person | ❌ |
Common Misconception
Myth: High inter-annotator agreement means the labels are correct. Reality: Agreement only means labelers are consistent with each other, not that they’re right. A confusing or biased guideline can make everyone agree on the wrong answer. Agreement measures reliability; correctness needs a separate check against an authoritative reference or expert review.
One Sentence to Remember
Inter-annotator agreement tells you whether your labels are reproducible — and reproducible labels are the floor, not the ceiling, for trustworthy training data, so measure agreement on a small overlap set before you commit to labeling at scale.
FAQ
Q: What is a good inter-annotator agreement score? A: It depends on the task and metric, but chance-corrected scores in the substantial-to-near-perfect range are usually treated as reliable. Subjective tasks like sentiment naturally score lower than objective ones.
Q: What’s the difference between inter-annotator and intra-annotator agreement? A: Inter-annotator agreement compares different people labeling the same items. Intra-annotator agreement checks whether one person labels the same item the same way when shown it again later. Both gauge consistency.
Q: Why not just use percent agreement? A: Percent agreement ignores chance. If one label appears most of the time, annotators agree often just by guessing it. Chance-corrected metrics like Cohen’s kappa strip out that luck for a truer picture.
Expert Takes
Agreement is not the same as correctness. Two annotators can agree and both be wrong if the guideline itself is flawed. What inter-annotator agreement actually measures is the reliability of a labeling process — its reproducibility. Chance-corrected metrics matter because raw percent agreement inflates when one label dominates. Treat agreement as evidence that your categories are well-defined, not proof that they are true.
Think of your annotation guideline as a spec. Low agreement is a failing test — it tells you the spec is ambiguous before that ambiguity gets baked into a model. The fix is the same as any spec problem: find the items annotators split on, write explicit rules for those edge cases, then re-measure. Agreement scores turn a vague “labels feel inconsistent” into a signal you can act on.
Data quality is becoming the real competitive moat, not model size. Anyone can fine-tune the same open weights, but the team with cleaner, more consistently labeled data ships a better product. Inter-annotator agreement is how you prove that quality instead of assuming it. Buyers evaluating annotation vendors now ask for agreement numbers up front. If you can’t report yours, you’re already negotiating from a weak position.
Agreement metrics can quietly launder bias into ground truth. When annotators share the same background, they agree easily — and that consensus can encode the same blind spots into every model trained on the data. High agreement looks like quality, but it may just mean nobody in the room disagreed. Who decided which label was correct, and whose perspective was missing when they did? Reliable is not the same as fair.