NannyML
Also known as: nannyml, NannyML Cloud, post-deployment data science
- NannyML
- NannyML is an open-source Python library for post-deployment monitoring that estimates a production model’s performance before ground-truth labels arrive and links data drift alerts to their actual impact on accuracy, so teams act only on drift that degrades the model.
NannyML is an open-source Python library for monitoring deployed machine learning models, estimating their accuracy before true labels arrive and flagging only the data drift that actually degrades performance.
What It Is
When a machine learning model goes into production — scoring loan applications, flagging fraud, or ranking product recommendations — the team rarely learns right away whether its predictions were correct. The true outcome, the “ground-truth label,” often arrives days or weeks later: a borrower defaults, a flagged transaction gets confirmed, a recommendation does or does not convert. During that waiting period a model can quietly degrade and nobody notices. NannyML exists to close that blind spot.
It does this with label-free performance estimation. Instead of waiting for the real answers, NannyML reads the model’s own confidence in each prediction to estimate how accurate it is right now. According to NannyML Docs, classification models are handled by CBPE (Confidence-Based Performance Estimation) and regression models by DLE (Direct Loss Estimation). The result is an estimate of metrics like accuracy or error before a single true label is available. It works a little like a teacher who gauges how a class performed on an exam by watching how confidently each student answered, well before the papers are graded.
The second piece is drift detection tied to impact. Data drift means the live input data has shifted away from what the model saw during training — customer behavior changes, or a new product launches. Most monitoring tools alert on every such shift, which buries teams in noise. NannyML instead links each drift signal back to its estimated effect on performance, so a team chases only the drift that actually moves accuracy. This matters most under covariate shift, where the inputs change but the underlying relationship between inputs and outcomes stays the same — exactly the situation in which a model keeps running without obvious errors yet loses accuracy. A commercial product, NannyML Cloud, builds a hosted dashboard and alerting layer on top of the open-source core.
How It’s Used in Practice
The most common setting is a data or ML team that already has models in production and faces delayed labels. They point NannyML at recent batches of production data plus the model’s predictions, and it returns an estimated-performance timeline alongside drift signals. When estimated accuracy crosses a chosen threshold, the team investigates or schedules model retraining, without waiting weeks for ground truth to confirm the problem.
In a monitoring stack like the one this article describes, NannyML usually sits next to a broader drift tool rather than replacing it. A library such as Evidently AI computes the distribution-shift metrics; NannyML answers the follow-up question that drift alone cannot — does this shift actually hurt the model? That division of labor is why the two are frequently deployed together.
Pro Tip: Don’t retrain on drift alone. A distribution can shift hard while accuracy stays flat, or barely move while performance quietly tanks. Wait for NannyML’s estimated-performance drop before paying the cost of retraining, and make sure your model’s probability outputs are calibrated first, because the estimate leans on those confidence scores being honest.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Ground-truth labels arrive with a long delay (credit, fraud, churn) | ✅ | |
| You need accuracy estimates between labeling cycles | ✅ | |
| Labels are available immediately and cheaply | ❌ | |
| You want one alert per real performance drop, not per distribution shift | ✅ | |
| Your model outputs uncalibrated, unreliable confidence scores | ❌ | |
| Data-quality or LLM-output testing is your primary goal | ❌ |
Common Misconception
Myth: NannyML measures your model’s real accuracy in production without ever needing labels. Reality: It estimates accuracy from the model’s confidence and the shape of incoming data. That estimate is statistically grounded, but it is still an estimate, not ground truth — you eventually need real labels to validate it and to keep the model’s confidence scores calibrated.
One Sentence to Remember
NannyML’s value is not detecting that your data changed — plenty of tools do that — but telling you whether the change is hurting the model before the real labels arrive to prove it, so you spend effort only on the drift that costs you accuracy.
FAQ
Q: How does NannyML estimate performance without labels? A: It uses the model’s predicted probabilities. According to NannyML Docs, CBPE handles classification and DLE handles regression, turning confidence scores and input distributions into estimated metrics like accuracy or error before true labels arrive.
Q: Is NannyML free? A: The core NannyML library is open source and free to use in Python. A separate commercial product, NannyML Cloud, adds a hosted dashboard, alerting, and managed monitoring on top of that open-source foundation.
Q: How is NannyML different from Evidently or Arize? A: Most drift tools tell you the data changed. NannyML focuses on estimating whether that change degrades model performance before labels arrive, so teams often run it alongside Evidently or Arize rather than instead of them.
Sources
- NannyML’s GitHub repository: NannyML/nannyml — post-deployment data science in Python - The open-source library, its CBPE and DLE estimators, and drift detection.
- NannyML Docs: Detecting Data Drift — NannyML documentation - Official documentation for label-free performance estimation and drift methods.
Expert Takes
Not measurement. Estimation. NannyML never sees the truth label at scoring time, so it infers performance from the model’s own confidence and the shape of the incoming data. The principle is that a well-calibrated classifier’s predicted probabilities already encode its expected error rate. Read those probabilities correctly under a shifted input distribution, and you can approximate accuracy before reality confirms it.
The failure mode is silent: a model keeps serving predictions while accuracy erodes, and the dashboard stays green because no labels have landed to prove otherwise. The fix is to make estimated performance a first-class signal in the pipeline, not an afterthought. Wire the estimate to your retraining trigger, define the threshold up front, and the gap between deployment and ground truth stops being a blind spot.
Monitoring is consolidating fast, and label-free performance estimation is the feature buyers now expect by default. The market has moved past “did the data drift” to “is it costing us money yet.” Tools that only flag distribution shifts are becoming commodities. The ones that connect drift to business impact own the conversation. That shift in expectation is where NannyML planted its flag.
An estimate is a confident guess dressed as a measurement. If a model’s predicted probabilities are biased — and they often are for underrepresented groups — then an estimate built on those probabilities inherits the same blind spot. So who is accountable when the dashboard says the model is fine and the people it quietly fails never show up in the numbers? The estimate is only as honest as the confidence beneath it.