Classifier-Free Guidance
- Classifier-Free Guidance
- Classifier-Free Guidance is a sampling technique for diffusion models that steers generation toward a prompt by blending predictions from a single network run with and without the conditioning signal, removing the need for a separate trained classifier.
Classifier-Free Guidance (CFG) is a sampling technique that steers diffusion models toward a text prompt by blending two predictions from the same network—one conditioned on the prompt, one unconditioned.
What It Is
Diffusion models like Stable Diffusion and FLUX work by starting with pure noise and gradually denoising it over dozens of steps. The problem: without extra help, the model has no strong pressure to match your text prompt. The output might look like a plausible image, but it drifts toward “any image,” not “the image you described.” Classifier-Free Guidance closes that gap.
Early diffusion papers solved prompt-following with classifier guidance — training a separate image classifier on noisy inputs and using its gradients to nudge generation toward a target class. It worked, but it had a serious cost. Every new label space, every new conditioning modality, meant training a new noise-aware classifier from scratch on corrupted inputs normal classifiers cannot handle. Classifier-Free Guidance replaced that entire parallel track.
The trick is surprisingly simple. According to Ho & Salimans, the authors trained one diffusion network on paired data but randomly dropped the conditioning signal — the text prompt or class label — a small fraction of the time during training. The same network now implicitly knows how to predict noise in two modes: with the prompt active, and with the prompt blanked out. No second model required.
At sampling time, both predictions are computed in parallel and combined using a guidance scale, usually labelled w or CFG. According to Ho & Salimans, the final noise estimate is a linear extrapolation between the unconditional and conditional predictions, pushed past the conditional in the direction the prompt points. Higher scale values pull output harder toward whatever the prompt described. Lower values let the model stay closer to its unguided, “anything plausible” behaviour. That single knob is what powers the “Guidance Scale” or “CFG” slider you see in every modern image UI — from AUTOMATIC1111 to ComfyUI to hosted services.
The downstream effect is that the same architecture stack behind a modern diffusion pipeline — the U-Net or DiT denoiser, the VAE, the scheduler, the text encoder — all inherit prompt-following capability from this one sampling trick. Remove CFG, and modern diffusion UIs become surprisingly bad at doing what you told them.
How It’s Used in Practice
When you open Stable Diffusion WebUI, ComfyUI, or a hosted service like DreamStudio or Leonardo, the CFG scale is usually the second setting after your prompt. It lives right next to the sampler choice and the step count. You type a prompt, pick a value, and every step of the denoising loop silently runs the network twice per image — once with your prompt active and once with it dropped — then blends the two predictions before the next step.
According to Stability AI, typical guidance scale values for classic Stable Diffusion text-to-image land between 7 and 12, while newer rectified-flow models like SD3 and FLUX.1 work better with lower values around 3.5 to 5, because their base sampling is already more prompt-aligned.
Pro Tip: If your images look oversaturated, have burned-in colours, or the subject looks weirdly exaggerated — your guidance scale is probably too high, not your prompt too vague. Drop the CFG by two notches and regenerate before rewriting the prompt. Most “bad prompt” problems in diffusion UIs are actually CFG problems in disguise.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Detailed text-to-image prompt where you want strict adherence | ✅ | |
| Classic Stable Diffusion or SDXL needing a stronger pull toward the prompt | ✅ | |
| Text-conditioned video, audio, or 3D diffusion pipelines built on the same math | ✅ | |
| Rectified-flow models pushed well past their recommended range — outputs oversaturate and lose diversity | ❌ | |
| Relying on high CFG to rescue a vague prompt — fix the prompt instead | ❌ | |
| Treating CFG as a general quality slider rather than a prompt-adherence slider | ❌ |
Common Misconception
Myth: Higher CFG always means better prompt adherence, so turn it up for sharper results. Reality: Past a model-specific sweet spot, higher CFG doesn’t just fail to help — it actively degrades the image. Colours oversaturate, faces distort, and compositions collapse. Rectified-flow models like FLUX.1 and SD3 break this rule much earlier than SDXL, because their base training is already more prompt-aligned and they need less pushing to follow the prompt.
One Sentence to Remember
Classifier-Free Guidance is the knob that tells your diffusion model how hard to chase your prompt versus how much to trust its own imagination — and the sweet spot lives inside a narrow, model-specific band, not at the top of the slider.
FAQ
Q: What does the CFG scale actually do? A: It controls how strongly the diffusion model follows your prompt. Higher values pull the output toward the prompt, lower values let the model explore freely. The sweet spot is model-specific.
Q: Why did Classifier-Free Guidance replace classifier guidance? A: Classifier guidance required a separate classifier trained on noisy images for every new task. CFG trains one network to predict both conditionally and unconditionally, so no extra classifier is ever needed.
Q: Is CFG still used in the newest diffusion models? A: Yes. Stable Diffusion 3, SD 3.5, and FLUX.1 all ship with CFG, just at lower guidance scales than SDXL, because rectified-flow training produces better prompt alignment from the start.
Sources
- Ho & Salimans: Classifier-Free Diffusion Guidance - Original paper introducing the CFG technique.
- Stability AI: Stable Diffusion 3: Research Paper - How newer rectified-flow models calibrate guidance scales at inference.
Expert Takes
Classifier-Free Guidance is a clean bit of mathematical economy. One network learns two distributions — conditional and unconditional — from the same weights, and inference extrapolates between them. Not a clever trick. An elegant factoring. The insight is realising you don’t need a separate classifier at all if the same model can both see the prompt and not see it. Steering replaced labelling, and an entire research track quietly dissolved.
CFG is what makes your prompt an actual spec rather than a suggestion. Without it, the model wanders. With it, every sampling step re-anchors to the prompt. Treat the guidance scale as a contract-strength setting: low means loose interpretation, high means strict literal follow. Most prompt engineering failures I see in diffusion workflows are really CFG tuning failures — the prompt was fine, the adherence knob was simply in the wrong position.
Every image model shipped in the last few years runs on CFG. That standardisation matters. Product teams building AI image features no longer have to train and ship a separate classifier per use case — they ship one diffusion model and expose a single slider. The switch from classifier guidance to CFG quietly collapsed an entire category of ML infrastructure work. You’re not tuning pipelines anymore. You’re tuning one parameter.
The guidance scale is a steering wheel pointed at the model’s imagination, and nobody agrees on what a sensible grip actually is. High CFG makes the model more obedient to the text, which sounds good — until you notice that “more obedient to text” also means “less able to surprise you with anything you didn’t already ask for.” Who decides the default value shipped in the UI? That’s a creative-freedom question disguised as a hyperparameter.