Git LFS

Also known as: Git Large File Storage, LFS, git-lfs

Git LFS
Git LFS (Large File Storage) is an open-source Git extension that keeps large files like datasets, models, and media out of the main repository. It commits a tiny pointer file in place of each large file and stores the actual content on a separate server, downloaded on checkout.

Git LFS (Large File Storage) is an open-source Git extension that stores large files like datasets and model weights outside the main repository, leaving only a small pointer that Git tracks instead.

What It Is

Git repositories are built for source code: thousands of small text files where Git can track exactly which lines changed between versions. That model breaks the moment you add a multi-gigabyte training dataset or a saved model file. Git keeps a complete copy of every version of every file in its history, so committing a large binary a few times can swell the repository far beyond the size of the files themselves, and everyone who clones it inherits that weight. Git LFS exists to keep large files versioned next to code without paying that price.

The mechanism is substitution. When you tell Git LFS to track a file pattern, such as every .csv or .bin, it intercepts those files before they enter the repository. Instead of committing the file’s bytes, it commits a small text pointer. According to git-lfs.com, this pointer is under 1 KB and carries a SHA-256 object ID, a unique fingerprint of the file’s contents. The real bytes live on a separate LFS server. Think of the pointer as a coat-check ticket: you keep the small stub while the bulky item waits in the back room until you present it. When someone checks out the branch, Git LFS reads the pointer and downloads the matching content on demand.

That design fixes the clone-size problem but inherits one of Git’s habits. Git LFS deduplicates per whole object: according to Perforce, changing a single row in a large file stores a complete new copy on the server rather than just the difference. For a dataset that gets edited often, storage still grows version by version, the same scaling limit that pushes teams toward purpose-built data versioning tools. Hosting platforms add their own ceilings. According to GitHub Docs, GitHub allows files up to 5 GB each through LFS, with storage and bandwidth metered separately from the repository itself.

How It’s Used in Practice

The most common place people meet Git LFS is a machine learning project living in a Git repository. A team wants their datasets and trained model files versioned next to the training code, so any commit captures both the code that produced a model and the model itself. Plain Git would make the repository unclonable within weeks. With Git LFS, the team runs a one-time setup, declares which file types to track, and from then on those large files are handled automatically: committed as pointers, stored on the LFS server, pulled only when needed.

The visible payoff is a fast clone. A new team member gets the code plus lightweight pointers immediately, then downloads only the large files their task actually needs instead of the entire history of every dataset version.

Pro Tip: Set up Git LFS tracking before you commit any large files, not after. If a big binary already landed in your Git history, adding LFS later won’t shrink the repository: the bytes are baked into past commits, and removing them means rewriting history for everyone. Decide what counts as “large” on day one.

When to Use / When Not

ScenarioUseAvoid
Versioning datasets or model weights alongside training code
A dataset re-edited daily that changes constantly
Storing large media assets (video, design files) in a code repo
Petabyte-scale data or strict data-lineage and time-travel needs
A small team that wants large files tracked with familiar Git commands
Frequent partial edits where you need diff-level deduplication

Common Misconception

Myth: Git LFS shrinks your repository or compresses large files. Reality: Git LFS doesn’t compress anything or reduce total storage. It moves large files out of the cloneable Git history and onto a separate server. The full bytes still exist, and because each change to a tracked file stores a complete new copy, total storage can grow faster than people expect.

One Sentence to Remember

Git LFS is the simplest way to keep large files versioned inside a Git workflow, but it relocates the storage problem rather than solving it: reach for it when your large files are relatively stable, and switch to dedicated data-versioning tools once they change constantly or outgrow a Git host.

FAQ

Q: Does Git LFS reduce the size of my repository? A: It reduces what others have to clone, not your total storage. Large files move to a separate server, but every version is still kept in full, so overall storage can keep growing.

Q: Is Git LFS free to use? A: The Git LFS software is open-source and free. Hosting platforms like GitHub include a free storage and bandwidth allowance, then charge for usage beyond it, so heavy use can incur costs.

Q: When should I use a tool like lakeFS or Delta Lake instead of Git LFS? A: Choose those when data changes constantly, grows past what a Git host allows, or you need branch-level data management and time travel. Git LFS suits relatively stable large files tracked beside code.

Sources

Expert Takes

Git LFS doesn’t make large files small. It changes what the version-control system tracks. Git was designed to reason about text line by line, which is meaningless for a compressed binary. By committing a content-addressed pointer instead of the bytes, Git LFS lets the repository stay a graph of small, comparable objects while the heavy content is handled by reference. The model is indirection, not compression.

The failure I see most is teams bolting Git LFS on after the large files already shipped into history. By then the repository is heavy, and the fix means rewriting history for everyone. Treat what counts as large as a spec decision you make at project setup, write the tracking rules into the repository config, and every contributor inherits the same behavior automatically. Configure the boundary once, up front.

Git LFS was the pragmatic answer when teams wanted their models and datasets to live next to code without learning new tooling. It still wins on familiarity. But the moment data becomes the product rather than an attachment to it, the market moves toward dedicated data-versioning platforms built for that scale. Git LFS is the on-ramp, not the destination, for serious data teams.

There’s a quieter cost to making large files this easy to commit. When versioning a dataset takes one command, every revision gets kept by default, and few teams ever ask which versions should have existed at all. Who is accountable for a training set whose full lineage nobody can reconstruct, even though every byte was dutifully stored? Convenience and genuine traceability are not the same thing.