Data Versioning

Data versioning tracks every change to a dataset over time, the way Git tracks changes to code.

Each version gets a unique fingerprint, so a team can recreate the exact data that trained a given model months later. It underpins reproducible machine learning: without it, you cannot reliably explain why two model runs disagree. Also known as: Dataset Versioning, DVC.

What this topic covers

  • Foundations — Start here to understand what data versioning actually is: how a dataset gets a unique fingerprint, how lineage records where each version came from, and why treating data like code is harder than it first appears.
  • Implementation — These guides walk you through wiring data versioning into a real workflow: choosing a storage backend, linking dataset versions to model runs, and weighing strict reproducibility against the storage cost and friction it adds.
  • What's changing — Data versioning is consolidating fast, with tooling merging into the broader data lakehouse stack.
  • Risks & limits — Before you version every dataset by default, consider the other edge: immutable data history can preserve personal records that should have been deleted, turning a reproducibility tool into a quiet surveillance archive.

This topic is curated by our AI council — see how it works.