Mlflow

Also known as: MLflow Tracking, MLflow Registry, Databricks MLflow

Mlflow
MLflow is an open-source platform for managing the machine learning lifecycle. It tracks experiment parameters and metrics, packages models in a standard format, and maintains a versioned model registry with promotion workflows for moving models from development through staging to production.

MLflow is an open-source platform that tracks machine learning experiments, packages trained models, and manages model versions through a centralized registry — connecting experiment runs to the models that reach production.

What It Is

When a data science team trains dozens of models to solve the same problem, they quickly run into a tracking problem: which configuration produced the best result, and can anyone reproduce it six months later? MLflow was built to answer that question. It records the full context of each training run — parameters (learning rate, batch size, regularization settings), metrics (accuracy, F1, loss per epoch), code version, and the resulting model file — so teams can compare runs systematically and reproduce any of them later.

Think of MLflow as a combination of a lab notebook and a version control system for models. The lab notebook captures what you tried and what happened; the version control system tracks which version is currently in use and who signed off on it. Together they answer the two questions production teams ask most often: “What produced this model?” and “Should this version be trusted with live traffic?”

The platform covers four concerns:

  • Tracking: Records each experiment run — parameters, metrics per epoch, tags, and any output files. Accessible through a browser UI, REST API, or Python client.
  • Projects: A packaging format for reproducible training code — specifies the Python environment and the entry commands needed to run a training job on any machine.
  • Models: A standard model format that allows multiple serving tools (REST servers, batch jobs, cloud functions) to load the same saved model without conversion.
  • Registry: A versioned catalog of trained models with an explicit promotion workflow. Each version moves through named stages — Staging, Production, Archived — and every transition is logged with a timestamp and the user who made it.

For the model registry concept — managing which version of a model is deployed, who approved it, and what evaluation metrics justified the decision — the Registry component is MLflow’s primary answer. The Tracking component feeds it: every model registered in the Registry links back to the experiment run that produced it, giving reviewers a direct path from “which version is in Production?” to “what training run created it, and what were its evaluation numbers?”

How It’s Used in Practice

Most teams start with MLflow Tracking. During model training, a few lines of logging code send metrics to the MLflow server after each evaluation step. The server stores the full run history, so a week later you can open the UI, filter runs by metric, and identify which hyperparameter combination produced the best validation accuracy — without relying on notes or memory.

Once a model passes evaluation, a data scientist registers it in the Model Registry. The registry assigns a version number and marks it Staging. A reviewer — often an ML engineer or team lead — validates the model against a test suite and, if it passes, transitions the version to Production. This staged approval creates a traceable handoff between model development and deployment, which is the core mechanism the parent article describes when covering how model registries govern what reaches production.

Pro Tip: Register every model that clears your baseline threshold, not just the one you plan to ship. The registry’s version history becomes your rollback menu — if the Production model degrades, you can transition the previous version back without retraining.

When to Use / When Not

ScenarioUseAvoid
Comparing many experiment runs with different hyperparameters
Solo project with one training script and no team handoffs
Managing the model handoff from data science to ML engineering
Your team uses a managed platform (Vertex AI, SageMaker) with a built-in registry
Reproducing a model trained months ago for audit or debugging
Real-time model serving with traffic splitting between versions

Common Misconception

Myth: MLflow stores and serves your models. Reality: MLflow records where models are stored — pointing to an artifact store like an S3 bucket or a shared file system — and provides a deployment API wrapper, but the serving infrastructure that processes prediction requests runs separately. MLflow tells you what version of a model to use and where to find it; it does not run the inference endpoint.

One Sentence to Remember

MLflow is the audit trail that connects a training experiment to its production model version, so when something breaks in production you can trace it back to the exact run, parameters, and data that produced it.

FAQ

Q: Is MLflow only useful for large teams? A: No. Solo data scientists use MLflow Tracking to compare experiments and avoid losing results to notebook clutter. The Registry becomes more valuable as team size grows and handoffs between roles require traceability.

Q: Do I need Databricks to run MLflow? A: No. MLflow is fully open source and runs on any server or locally on a laptop. Databricks offers a managed hosting service, but the open-source version has no dependency on Databricks infrastructure.

Q: How does MLflow’s Model Registry differ from other model registries? A: MLflow’s Registry is one implementation of the model registry concept. Other platforms — Vertex AI Model Registry, SageMaker Model Registry, Hugging Face Model Hub — offer similar versioning and staging workflows as managed cloud services rather than self-hosted open-source tooling.

Expert Takes

MLflow’s core value is reproducibility. Each experiment run stores the exact code version, parameters, and data snapshot that produced a model — not just the final metrics. This means any model artifact can be traced back to its origin conditions, which is the minimum requirement for making ML experiments falsifiable. Without that trace, comparing two models means comparing outcomes without knowing whether the conditions were actually different.

For a specification-driven workflow, MLflow’s Model Registry is where contracts live. Each model version carries metadata — the run that produced it, the metrics that qualified it, the approval stage it reached. When you update context for an LLM-powered system and the underlying model version changes, the registry lets you trace which evaluation run justified that change. No registry, no audit trail; no audit trail, no confidence in what shipped.

Most teams still hand off models via Slack and shared drives. MLflow changes that calculus: the approval workflow forces a human gate between Staging and Production, the tracking server surfaces which experiments ran before a decision, and the artifact URI makes “which version is live?” a query, not a conversation. Teams that skip this pay in lost hours when something goes wrong in production.

MLflow records who approved a model version and when — but not why the criteria were good enough. An approval entry in the registry is only as meaningful as the standards used to grant it. If teams promote models without documented evaluation thresholds, the audit trail becomes a record of actions without a record of reasoning. The metadata is there; the accountability is not guaranteed by the tool alone.