MONA explainer 10 min read March 28, 2026

Demographic Parity vs. Equalized Odds vs. Calibration: Core Fairness Metrics Compared

Three intersecting geometric boundaries representing competing fairness constraints across a population distribution

Table of Contents

ELI5

Fairness metrics measure whether an AI system treats different groups equally — but “equally” has at least three mathematical definitions, and satisfying one can violate another.

A hiring algorithm rejects more candidates from one demographic group than another, and the team building it disagrees about whether the system is biased — not because anyone is lying, but because each person is measuring fairness with a different ruler. One checks whether acceptance rates match across groups; another checks whether the model’s errors are distributed equally; a third checks whether a risk score of 0.8 means the same thing regardless of who received it. All three analyses are mathematically valid. All three produce contradictory verdicts from identical data.

The Three Families That Refuse to Converge

Bias And Fairness Metrics did not emerge from a single coherent tradition. They were born in separate communities — anti-discrimination law, statistical decision theory, and probability calibration — each encoding a different intuition about what “fair” should mean. The result is three families of metrics that share a vocabulary but disagree on the math (Fairlearn Docs).

What are the main types of fairness metrics used in AI systems?

The three families are Independence, Separation, and Sufficiency. The fastest way to understand them is through what each one ignores.

Independence ( Demographic Parity): the prediction should be statistically independent of the sensitive attribute. Formally, E[h(X)|A=a] = E[h(X)] for all values of a. A loan approval system satisfies demographic parity if it approves the same proportion of applicants in every group — period. It does not ask whether those applicants are equally creditworthy. That is both its appeal and the reason critics call it naive.

Separation ( Equalized Odds): the prediction should be independent of the sensitive attribute conditional on the true outcome. Formally, E[h(X)|A=a,Y=y] = E[h(X)|Y=y] for all a and y (Hardt et al.). The model’s true positive rate and false positive rate must be equal across groups. Separation permits unequal acceptance rates — as long as the errors are distributed fairly. It anchors fairness in the Confusion Matrix, not the outcome distribution.

A relaxation called equal opportunity requires only that the true positive rate be equal across groups, dropping the constraint on false positives. When the ML fairness literature mentions “equality of opportunity,” this is usually what it means — a one-sided version of equalized odds.

Sufficiency (calibration): the true outcome should be independent of the sensitive attribute conditional on the prediction score. Formally, P[Y=1|S=s,A=a] = P[Y=1|S=s] for all s and a (Kleinberg et al.). A risk score of 0.7 should mean a 70% probability of the event, whether the individual belongs to group A or group B. Calibration does not care about acceptance rates or error rates. It cares about whether the score means what it claims to mean.

Three definitions. Three axioms. Each captures something real about fairness — and each is blind to what the others measure.

Inside the Arithmetic of Disagreement

The tension between these metrics is not philosophical. It is arithmetic — visible the moment you apply two definitions to the same classifier on the same data. The gap between them reduces to a single variable, and once you see it, the impossibility results that follow become almost obvious.

What is the difference between demographic parity and equalized odds?

Demographic parity asks: does the system produce the same positive rate across groups? Equalized odds asks: given what the right answer actually is, does the system make the same mistakes across groups?

The distinction collapses to a single conditioning variable: the true label Y.

Suppose group A has a 30% base rate for the positive outcome and group B has a 60% base rate. A perfectly accurate classifier — one that makes zero errors — will automatically violate demographic parity. It approves 30% of group A and 60% of group B. Unequal rates, but correctly reflecting the underlying populations.

Equalized odds is satisfied by this perfect classifier because both groups have identical TPR (100%) and identical FPR (0%). The errors are equal; they are both zero.

Demographic parity, forced to equalize the rates, would require the system to either approve more people from the low-base-rate group or reject more from the high-base-rate group. Both adjustments introduce errors that were not present before. The collision is not ideological — it is about base rates. When the underlying prevalence of the positive outcome differs between groups, equalizing outcomes and equalizing errors become mathematically incompatible goals.

Calibration adds a third axis of tension. A calibrated model assigns scores that reflect true probabilities — but if base rates differ, a calibrated model will necessarily produce different acceptance rates per group. Calibration and demographic parity pull in opposite directions whenever the population data is not identically distributed across groups.

What statistics and probability concepts do you need to understand bias and fairness metrics?

The math underneath fairness metrics is conditional probability — and the errors that matter live in the confusion matrix.

Four numbers define the confusion matrix for a binary classifier: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). Every fairness metric is a constraint on ratios derived from these four counts, computed per group.

True positive rate (TPR, recall): TP / (TP + FN) — of all actual positives, how many did the model catch?
False positive rate (FPR): FP / (FP + TN) — of all actual negatives, how many did the model wrongly flag?
Positive predictive value (PPV, precision): TP / (TP + FP) — of all predicted positives, how many were correct?

Equalized odds constrains TPR and FPR. Calibration constrains PPV (among other quantities). Demographic parity ignores all of these and looks only at the overall positive prediction rate.

The concept of base rate — the prevalence of the positive class in each group — is the variable that determines whether these constraints can coexist. When base rates are equal across groups, most fairness metrics can be satisfied simultaneously. When they differ, the Impossibility Theorem activates.

Bayes’ theorem connects these rates: the relationship between TPR, FPR, PPV, and base rate is not independent. Shift one constraint, and the others move. This is not a design flaw in the metrics. It is a consequence of probability itself.

Diagram showing three fairness metric families — Independence, Separation, and Sufficiency — with their formulas and the confusion matrix quantities each constrains — Three families of fairness metrics constrain different parts of the confusion matrix and diverge when base rates differ across groups.

The Theorem That Forbids Compromise

The impossibility results arrived in 2016 and 2017, from two independent directions, and they settled a question many practitioners had hoped to solve with better algorithms.

Kleinberg, Mullainathan, and Raghavan proved that calibration and certain balance conditions — which relate to equalized odds — cannot be simultaneously satisfied except in trivial cases where the classifier is either perfect or the base rates are identical across groups (Kleinberg et al.). Chouldechova proved a complementary result: when base rates differ, no classifier can simultaneously equalize FPR, FNR, and PPV (Chouldechova).

Not a limitation of current methods. A mathematical boundary.

The implication is uncomfortable but precise: choosing a fairness metric is a normative decision, not a technical one. The math tells you what each metric measures and what it ignores. It cannot tell you which trade-off is acceptable for your application.

In employment contexts, the Four Fifths Rule provides a regulatory anchor — the selection rate for any group must be at least 80% of the highest group’s rate (EEOC). This is a practical instantiation of the demographic parity family, codified in U.S. federal guidelines long before ML fairness became a research field.

The tooling has caught up to the theory. Fairlearn (v0.13.0, as of October 2025) provides demographic_parity_difference(), equalized_odds_difference(), and their ratio variants — all scikit-learn compatible (Fairlearn Docs). AI Fairness 360 offers 15 mitigation algorithms covering pre-processing, in-processing, and post-processing interventions; its sklearn-compatible API is the recommended path, though the last release (v0.6.1) dates to April 2024 — verify current development activity before adopting for new projects (AIF360 GitHub).

Neither toolkit resolves the impossibility theorems. They make it easier to measure and mitigate along whichever axis you choose — but the choice itself remains yours.

Rule of thumb: Pick the metric that aligns with the harm you most need to prevent. If wrongful rejection destroys a livelihood, equalize false negative rates. If wrongful acceptance creates risk, equalize false positive rates. If the score feeds directly into downstream decisions, calibrate.

When it breaks: When base rates differ substantially between groups — common in criminal justice, healthcare, and lending — optimizing one fairness metric actively degrades another. Teams that skip the normative conversation and default to whatever metric the library computes first end up encoding a value judgment they never explicitly made. The impossibility theorems guarantee that no algorithm resolves this on their behalf. Counterfactual Fairness offers an alternative framing — asking whether the prediction would change if the individual belonged to a different group — but it requires causal models that are rarely available in practice.

The Data Says

Fairness metrics are not competing answers to the same question. They are answers to three different questions — about outcomes, about errors, and about meaning — that happen to share a name. The impossibility theorems prove they cannot be unified when base rates diverge. Choosing among them is the first engineering decision in any fairness pipeline, and it is not a decision any algorithm can make. A model that produces a Hallucination about its own fairness — reporting one metric as if it settles the matter — is arguably more dangerous than a model with a known, documented bias.

Aha Moments

MAX

Three definitions of fairness, one deployment. The math says you cannot have all of them — and most teams never decide which one they actually want. That is a specification failure, not an optimization failure. Every system that measures “fairness” without declaring which metric it uses is running an implicit, untested spec. The fix: your fairness requirement document names the metric, states why that metric was chosen over the alternatives, and defines what a passing threshold looks like. If your team cannot fill in those fields before training begins, the model is underspecified. Underspecified systems fail in ways nobody anticipated, because nobody wrote down what success meant. Write the spec. The metric selection is the spec.

DAN

The impossibility theorems are not academic footnotes — they reshape how fairness tooling gets adopted in production. Teams that default to demographic parity because it is the easiest to compute are making a strategic bet they did not intend. The regulatory ground is already shifting: the EU AI Act demands documented bias mitigation for high-risk systems, and employment law in the U.S. has enforced the four-fifths rule for decades. Companies that treat metric selection as a product decision — documented, justified, auditable — will survive regulatory scrutiny. Companies that treat it as a checkbox will not. That is not a compliance opinion. That is a liability forecast.

ALAN

Both of you frame this as solvable — write a better spec, pick a smarter strategy. But the impossibility theorems say something more uncomfortable: every choice of fairness metric protects one group’s interests at the mathematical expense of another’s. This is not a design flaw to be engineered away. It is the structure of the problem. Who decides which trade-off is acceptable? The engineers who built the system? The executives who commissioned it? The communities whose lives the scores sort into categories? When the math guarantees that fairness cannot be universal, every system labeled “fair” embeds a political judgment about whose harm is tolerable — usually made by people who will never bear that harm themselves. How do we build legitimate trust in systems whose fairness guarantees are, by mathematical proof, always partial?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors