SAM 2
- SAM 2
- SAM 2 (Segment Anything Model 2) is Meta’s open-weight foundation model for promptable image and video segmentation. Released under Apache 2.0, it accepts clicks, boxes, or masks as prompts and uses a memory module to track objects across video frames, even through occlusions.
SAM 2 is Meta’s open-weight foundation model for promptable segmentation across both images and video, using clicks, boxes, or masks to isolate any object in real time.
What It Is
SAM 2 solves a problem that used to require a custom-trained model for every object class: telling a computer exactly which pixels belong to the thing you care about. Whether you want to remove a background, mask a face, or follow a specific car through a video clip, SAM 2 produces a precise outline from a single click, box, or rough mask — no labeled training data, no per-class fine-tuning, and no separate code paths for stills versus video.
The architecture has three parts: an image encoder that turns each frame into a feature map, a prompt encoder that translates user input into a query, and a mask decoder that predicts the segmentation. According to SAM 2 paper, the model adds a per-session memory module that stores feature embeddings from previous frames, so when an object disappears behind another and reappears, SAM 2 still knows it is the same instance.
According to Meta AI Research, SAM 2 ships in four sizes — tiny, small, base plus, and large — and runs at roughly 44 FPS on a single GPU, making it fast enough for interactive editing tools and live video pipelines. The full model and weights are available on GitHub under Apache 2.0, which means commercial products can ship it without licensing fees.
For someone evaluating segmentation options, the practical takeaway is unification. Before SAM 2, image cutouts and video object tracking were separate problems handled by separate libraries. SAM 2 collapses them into one model with one API, so a tool that does headshot cutouts can use the same backbone to track a presenter in a webinar recording.
How It’s Used in Practice
SAM 2 sits inside the segmentation step of most AI-powered background removal and object isolation pipelines. In a typical product workflow — say a photo editor that lets the user remove the background of a product shot — the app sends the image and a single click on the subject to SAM 2, gets back a high-precision mask, and uses that mask to compose a transparent PNG or replace the background. According to Meta AI blog, this exact pattern powers the Cutouts feature inside Meta’s Instagram Edits app.
The same model handles video. Instead of clicking once per frame, the user clicks once on frame zero and the memory module propagates the mask across the rest of the clip. That makes SAM 2 the segmentation backbone for tools that previously needed expensive per-frame manual work, like rotoscoping (drawing a mask around a subject frame by frame) in video editors or tracking a single dancer in a crowd.
Pro Tip: If you only need clean cutouts on still photos and your subjects are mostly people, products, or animals, the largest checkpoint is overkill — start with base plus or even tiny. The size jump matters most when you need to track small or visually similar objects across long video clips, which is what SAM 2.1 specifically improved.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Promptable cutouts where the user clicks the subject | ✅ | |
| Single-shot, fully automated background removal at API scale | ❌ | |
| Tracking one object across frames of a video | ✅ | |
| Hair-level alpha matting on portraits without post-processing | ❌ | |
| Self-hosting a segmentation model under a permissive license | ✅ | |
| Edge deployment on phones with tight memory budgets | ❌ |
Common Misconception
Myth: SAM 2 is just the original SAM with video bolted on. Reality: According to SAM 2 paper, SAM 2 is a unified model trained on both image and video data, not a video wrapper around an image model. The memory module that tracks objects across frames also helps it segment cleaner masks on still images, because the same architecture learns from richer temporal context.
One Sentence to Remember
SAM 2 is the foundation segmentation layer most modern AI cutout tools sit on top of — fast, open-weight, and the same model whether you point it at a photo or a video.
FAQ
Q: Is SAM 2 free to use in commercial products? A: Yes. According to Meta AI Research, SAM 2 is released under Apache 2.0, which permits commercial use, modification, and redistribution without licensing fees, including inside closed-source applications and paid SaaS products.
Q: Does SAM 2 replace tools like Remove.bg or Photoroom? A: Not directly. SAM 2 is the segmentation model. Background removal services add automatic prompting, alpha matting refinement, and hosted infrastructure on top, which most teams do not want to build themselves.
Q: What is the difference between SAM 2 and SAM 2.1? A: SAM 2.1 is the current production checkpoint, released September 30, 2024. According to Encord, it improves handling of visually similar objects, small objects, and longer occlusions while keeping the same architecture and API as SAM 2.
Sources
- Meta AI Research: Introducing Meta Segment Anything Model 2 (SAM 2) - Official model card with capabilities, variants, license, and inference benchmarks.
- SAM 2 paper: SAM 2: Segment Anything in Images and Videos - Architecture and training details for the unified image/video segmentation model and its memory module.
Expert Takes
SAM 2 is best understood as a foundation model for segmentation, in the same sense that large language models are foundation models for text. The training objective is generic: predict a mask given an image or frame and a prompt. The memory module is what makes video work — it lets attention reach back into earlier frames so the same object stays the same object even when half of it is hidden. No labels per class. Just structure and scale.
For specification-driven workflows, SAM 2 changes the contract for “isolate this thing.” Your spec no longer needs to describe the segmentation algorithm — you just declare the object you want and the prompt format the user will provide. The model handles the rest, identically for stills and video. Pin the checkpoint version in your spec and you get reproducible cutouts; leave it floating and you get drift the day Meta ships an update.
Permissive open-weight licensing is the part that quietly restructured the segmentation market. Closed APIs that charged per cutout suddenly compete against a model any team can self-host. The winners are not the ones with the best mask quality — that ceiling is now shared. The winners are the ones who wrap SAM 2 in workflows the buyer cannot reproduce in a weekend: bulk pipelines, brand-aware refinement, hosted scale, integrations. The model is the floor, not the moat.
A model that can isolate any object from any image or video is a tool, but it is also a building block for less benign systems — automated surveillance crops, identity removal at scale, deepfake compositing pipelines. The license permits these uses; the architecture does not care. Releasing such capability openly is a defensible choice for research, but the responsibility for how it ends up deployed shifts from Meta to whoever ships the next thousand products built on top.