Aequitas

Also known as: Aequitas toolkit, bias and fairness audit toolkit, Aequitas Flow

Aequitas
Aequitas is an open-source Python toolkit from the University of Chicago’s Data Science for Social Good group that audits machine learning models for bias and fairness across population subgroups, and, since its Aequitas Flow rewrite, also applies bias mitigation methods.

Aequitas is an open-source toolkit that audits machine learning models for bias, measuring whether predictions treat population subgroups — defined by attributes like race, gender, or age — fairly across the board.

What It Is

Dataset bias — the selection, representation, and measurement problems covered in the parent article — starts in your data, but it does not stay there. Once a model trains on skewed data, the skew shows up in who the model gets right and who it gets wrong. Aequitas exists to make that visible. It takes a model’s predictions, splits them by a sensitive attribute (such as race, gender, or age group), and measures whether the model performs equally well for each group. For a product owner deciding whether a classification model is safe to ship, it turns “we think the data might be biased” into a concrete, group-by-group report.

The toolkit works from a confusion matrix — the tally of a model’s correct and incorrect predictions. From that tally it computes group fairness metrics for each subgroup: the true positive rate (how often the model correctly flags the cases it should), the false positive rate (how often it wrongly flags cases it should not), precision, and several others. It then compares each group against a reference group you choose and reports where the gaps are large enough to matter. The output reads less like a single accuracy score and more like an audit ledger: every group, every error type, side by side.

Aequitas began as a pure audit tool from the University of Chicago’s Data Science for Social Good group. According to the Aequitas GitHub, the project later added an end-to-end Fair ML layer called Aequitas Flow, which extends it beyond measuring bias into mitigating it with pre-, in-, and post-processing methods. According to aequitas PyPI, the toolkit is MIT-licensed and its latest release is version 1.1.0. You can run it three ways — a Python library, a command-line tool, and a web application — so a data scientist and a policy reviewer can use the same engine through different doors.

How It’s Used in Practice

The most common moment to reach for Aequitas is right before a model goes live, or during a review after it has. A team has a classifier that informs a consequential decision — approving a loan, ranking a job applicant, flagging a transaction. They take the model’s predictions on held-out data, add a column marking which protected group each record belongs to, and feed both to Aequitas. The result is a report showing, for each group, how often the model was right, how often it raised a false alarm, and how those rates diverge from the reference group. The disparities the dataset bias article warned about in the abstract now have numbers attached to specific groups.

This is where it connects to the work upstream. Understanding selection, representation, and measurement bias tells you where unfairness might enter; Aequitas tells you whether it actually did, and for whom — diagnose the data, ship a model, audit the outcomes, then feed what you find back into the next round of collection.

Pro Tip: Choose your reference group deliberately before you run the audit, not after you see the results. Every fairness metric Aequitas reports is relative to that reference, so picking it to flatter the numbers quietly defeats the point. Audit on the data slices your business actually acts on, not just the overall population.

When to Use / When Not

ScenarioUseAvoid
Checking whether a classifier’s error rates differ across demographic groups
You only need a single overall accuracy or AUC score
Producing a documented fairness report for stakeholders or regulators
Auditing a free-text generative model with no clear classification labels
Comparing several candidate models on group fairness before deployment
Deciding the ethical definition of “fair” for your domain

Common Misconception

Myth: Running Aequitas makes a model fair. Reality: Aequitas measures and surfaces disparities; it does not, on its own, make a model fair. Even with Aequitas Flow’s mitigation methods, the tool cannot decide what fairness means for your context — which metric matters, which group is the reference, what gap is acceptable. Those are human and policy choices. Aequitas gives you the evidence to make them, not the verdict.

One Sentence to Remember

Aequitas is the measuring tape, not the tailor: it tells you precisely where a model treats groups differently, but closing that gap — and deciding which gaps count — is still your call. Pair it with a clear understanding of where your dataset bias comes from, and the audit becomes a feedback loop rather than a one-off checkbox.

FAQ

Q: Is Aequitas free to use? A: Yes. According to the Aequitas GitHub, it is open-source under the MIT license, maintained by the University of Chicago’s Data Science for Social Good group, and free for both commercial and research use.

Q: Does Aequitas fix bias or only detect it? A: It started as a detection-only audit tool. According to the Aequitas GitHub, the Aequitas Flow rewrite added mitigation methods, so it can now both measure disparities and apply techniques to reduce them.

Q: What kind of models can Aequitas audit? A: Any model whose predictions can be framed as a classification with group labels — approve or deny, flag or pass. It works on the predictions and a sensitive attribute, not the model internals.

Sources

Expert Takes

Fairness is not one number. It is a family of group metrics that can disagree. A model can equalize false-positive rates across groups and still differ on precision. Aequitas makes this concrete by reporting the full confusion matrix per subgroup instead of collapsing it into a single accuracy figure. The takeaway: there is no one fairness score to optimize, only trade-offs you must name.

Treat the fairness audit as part of your spec, not a final inspection. The decisions that shape the report — which attribute is sensitive, which group is the reference, what gap is acceptable — belong in writing before training, not improvised after. Wire Aequitas into the same pipeline that runs your tests, and a regression in group fairness fails the build like any other broken contract.

Fairness auditing is moving from nice-to-have to table stakes. Procurement teams and regulators increasingly ask vendors to show, not assert, that a model treats groups evenly. An open-source toolkit that produces a defensible report turns that demand from a liability into a selling point. The teams that build auditing into their release process now will not be scrambling when a buyer’s questionnaire arrives.

A clean audit report is reassuring, which is exactly its danger. Aequitas can only measure the groups you thought to label and the metric you chose to privilege. The harm done to a subgroup nobody encoded stays invisible, and a passing score can launder a model into looking just. Who decides which attributes count, and who answers for the ones left out?