lakeFS
Also known as: lake FS, Treeverse lakeFS, data lake version control
- lakeFS
- lakeFS is an open-source data version control system that layers Git-style branching, committing, and merging onto object storage such as Amazon S3, Azure Blob, or Google Cloud Storage, making data lakes reproducible and rollback-safe.
lakeFS is an open-source tool that brings Git-style version control to data lakes, letting teams branch, commit, and roll back datasets stored in object storage like Amazon S3 without copying the data.
What It Is
When a machine learning model that worked last month suddenly performs worse after retraining, the first question is always the same: what changed in the data? For code, Git answers that instantly. For the datasets sitting in cloud storage, the answer used to be a folder named final_v2_REAL. lakeFS exists to close that gap. It gives the files in a data lake the same branch-commit-merge workflow developers already use for source code, so a team can pin an exact dataset version, reproduce a past result, or undo a bad write, the same idea behind treating dataset changes like Git commits.
lakeFS sits between your applications and the storage bucket. Instead of reading and writing objects directly, tools point at a lakeFS repository, which records every change as an immutable commit. The key move is zero-copy branching: creating a branch of a multi-terabyte dataset is a metadata operation, not a physical copy, so it is near-instant and adds almost no storage cost. According to lakeFS Docs, this versioning model relies on zero-copy branches, immutable commits, atomic merges, and metadata stored as SSTables, the same sorted-file structure databases use for fast lookups.
Three operations carry most of the value. A commit captures an immutable snapshot of the entire dataset at a point in time. A branch creates an isolated copy where you can experiment or stage new data without touching production. A merge atomically promotes a validated branch back to the main line, so downstream consumers never see a half-finished update. lakeFS is built by Treeverse and released under the Apache 2.0 license, according to lakeFS’s GitHub repository.
How It’s Used in Practice
The most common reason teams reach for lakeFS is reproducible machine learning. A model is only as auditable as the data it trained on, so teams commit the exact dataset used for each training run and tag it alongside the model. Months later, when a regulator or a curious colleague asks why the model made a particular decision, the team can check out that commit and rebuild the exact data the model saw, with no guesswork and no missing files.
The second common pattern is safe ingestion. New raw data lands on a branch first, where validation jobs check it for schema drift, missing values, or corruption. Only after those checks pass does the branch merge into main, so a bad batch never silently poisons every downstream dashboard and model. If something does slip through, a single revert rolls the whole lake back to the last good commit.
Pro Tip: Don’t try to version every scratch file. Point lakeFS at the datasets that actually feed production models and dashboards, and let genuinely temporary working files live outside it. Versioning everything turns your commit history into noise and buries the snapshots that matter.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Reproducing the exact dataset behind a past model | ✅ | |
| Petabyte-scale data lakes on S3, Azure, or GCS | ✅ | |
| Validating new data on a branch before it reaches production | ✅ | |
| A handful of small CSVs that Git or Git LFS already handles | ❌ | |
| Row-level ACID transactions inside one table format | ❌ |
Common Misconception
Myth: lakeFS makes a full copy of your data every time you branch, so it doubles storage costs. Reality: Branching is a metadata-only operation. A new branch points at the same underlying objects until you actually change something, and only the changed files get written. According to lakeFS Docs, this zero-copy model is what makes branching a multi-terabyte dataset near-instant instead of an overnight copy job.
One Sentence to Remember
Think of lakeFS as Git for the data lake: it lets teams branch, commit, and roll back datasets in object storage without copying terabytes, so “which data made this model?” always has a precise, reproducible answer. If your AI work depends on knowing exactly what your models learned from, that auditability is the entire point.
FAQ
Q: Is lakeFS the same as Git for data? A: It borrows Git’s branch, commit, and merge model but is built for object storage, not source files. lakeFS versions terabyte-scale datasets in place, where plain Git would choke on the file size.
Q: Does lakeFS work with my existing S3 setup? A: Yes. According to lakeFS Docs, it runs on top of Amazon S3, Azure Blob Storage, and Google Cloud Storage, so your data stays in the bucket you already use while lakeFS adds the version layer.
Q: How is lakeFS different from Delta Lake or Apache Iceberg? A: Delta Lake and Iceberg version individual tables; lakeFS versions the entire lake across many files and formats. Teams often run lakeFS around table formats rather than choosing one instead of the other.
Sources
- lakeFS’s GitHub repository: treeverse/lakeFS — Data version control for your data lake - Source code, license, and release history for the lakeFS core server.
- lakeFS Docs: Versioning Internals — lakeFS Documentation - How zero-copy branches, immutable commits, and atomic merges are implemented.
Expert Takes
What lakeFS really provides is content-addressable immutability. Each commit is a snapshot identified by the state of its contents, not by a label someone can quietly overwrite. That property is what makes reproducibility possible: if the identifier matches, the data is provably identical. Git proved this idea for text. lakeFS extends the same guarantee to datasets far too large to fit on one machine.
Treat the dataset as a deployable artifact, not a loose pile of files. lakeFS lets you pin the exact data a pipeline consumed the same way you pin a dependency version, so a run is defined by its code commit and its data commit together. Add a branch-validate-merge step to ingestion and bad data fails the check instead of reaching production. Reproducibility becomes a default, not a heroic recovery effort.
Data versioning is quietly becoming table stakes. As AI moves into regulated industries, “we think this is the data we trained on” stops being an acceptable answer. The tooling is consolidating too: the team behind lakeFS now stewards the most popular lightweight data-versioning project as well, signaling that reproducible data is no longer a niche concern but core infrastructure every serious AI team will be expected to have.
Versioning makes data auditable, but auditable is not the same as fair. A perfectly reproducible pipeline can faithfully reproduce a biased dataset, and a clean commit history can lend a false sense of rigor to data that was flawed from the start. The harder question lakeFS does not answer: who decides what belongs in the dataset, and who reviews that decision before it becomes the permanent, version-locked record everyone trusts?