Evidence Lower Bound

Also known as: ELBO, Variational Lower Bound, Negative Variational Free Energy

Evidence Lower Bound
A mathematical lower bound on the log-likelihood of observed data, used as the training objective in variational autoencoders. ELBO combines reconstruction loss measuring output fidelity with KL divergence penalizing deviation from a chosen prior distribution.

The evidence lower bound (ELBO) is the optimization target that variational autoencoders maximize during training, combining reconstruction accuracy with a regularization penalty that keeps the latent space well-organized.

What It Is

When you train a generative model like a variational autoencoder (VAE), you want the model to learn patterns in your data so it can create new, realistic samples. The problem: directly computing how likely the data is under the model — a quantity called “evidence” or marginal likelihood — requires summing over every possible internal representation. For any real dataset, that sum is computationally impossible.

ELBO provides a computable stand-in. It is a value guaranteed to be less than or equal to the true evidence, so maximizing it pushes the model toward better performance without needing the impossible calculation.

Think of ELBO like a two-part score on a student exam. One part measures how accurately the student (the decoder) can recreate the original question from their notes (the latent representation) — this is the reconstruction term. The second part checks whether the note-taking system is organized and general-purpose rather than full of cramming shortcuts — this is the KL divergence term. A good score on both means the model learns meaningful patterns instead of memorizing specific examples.

According to Wikipedia, the formal decomposition follows from applying Jensen’s inequality to the log-marginal likelihood:

ELBO = E_q[log p(x|z)] - D_KL(q(z|x) || p(z))

The first term rewards the decoder for producing outputs that closely match the original input. The second term penalizes the encoder when its learned distribution q(z|x) drifts too far from the prior p(z), which is typically a standard normal distribution. According to Kingma & Welling (2013), this decomposition makes VAE training tractable — both terms can be optimized simultaneously through gradient descent.

The gap between ELBO and the true log-likelihood equals the KL divergence between the approximate posterior and the true posterior. A smaller gap means a tighter bound and a better-trained model.

This connects to the reparameterization trick. ELBO includes an expectation over the encoder’s distribution, but gradients cannot flow through random sampling. The reparameterization trick reformulates sampling as a deterministic function plus external noise, making ELBO optimizable with standard backpropagation.

How It’s Used in Practice

Most people encounter ELBO through VAE-based applications. When a VAE generates new images, compresses data, or learns a structured latent space, ELBO is the objective function driving that training. ML engineers monitor ELBO (or its negative, reported as the loss) during training to gauge progress. A rising ELBO means both reconstruction quality and latent space organization are improving.

In practice, teams decompose ELBO into its two components and track them on separate charts. This reveals problems the combined number hides — reconstruction loss might improve while KL divergence drops to near zero, signaling the model is ignoring its latent space.

Beyond standard VAEs, ELBO variants power the VAE stage in latent diffusion models. These models compress images into a latent space using a VAE trained with ELBO, then run diffusion in that compressed space — making ELBO foundational even when diffusion does the visible work.

Pro Tip: If your VAE produces blurry outputs but ELBO keeps improving, the KL term might be dominating. Try KL annealing — reduce the KL weight early in training and gradually increase it. The model learns strong reconstructions first, then tightens latent space regularization.

When to Use / When Not

ScenarioUseAvoid
Training a VAE for image generation or data compression
You need a tractable objective for models with latent variables
Your model has a simple likelihood you can compute directly
Training a discriminative classifier with labeled data
Building a generative model where exact likelihood is intractable
Your posterior approximation is too restrictive for the data complexity

Common Misconception

Myth: Maximizing ELBO is the same as maximizing the likelihood of your data. Reality: ELBO is a lower bound, not the likelihood itself. Maximizing it pushes the model in the right direction, but a gap always remains unless the approximate posterior matches the true posterior exactly. Researchers pursue tighter bounds and more expressive posterior families specifically to shrink this gap.

One Sentence to Remember

ELBO is the computable stand-in for an impossible calculation — it lets VAEs learn by balancing “reconstruct the input accurately” against “keep the latent space organized,” and improving either side makes the whole model better at generating new data.

FAQ

Q: Why can’t we just compute the exact likelihood instead of using ELBO? A: Computing exact likelihood requires integrating over all possible latent representations. For high-dimensional data and complex models, this integral has no closed-form solution and is too expensive to approximate directly.

Q: What happens when the KL divergence term in ELBO drops to zero? A: The encoder produces the same distribution regardless of input — a problem called posterior collapse. The model ignores the latent space entirely and the decoder generates the same average output for everything.

Q: How does ELBO relate to the reparameterization trick in VAEs? A: ELBO requires computing gradients through a sampling operation. The reparameterization trick reformulates sampling as a deterministic function plus external noise, allowing standard backpropagation to optimize ELBO end-to-end.

Sources

  • Wikipedia: Evidence lower bound - Formal definition, derivation via Jensen’s inequality, and decomposition of ELBO
  • Kingma & Welling (2013): Auto-Encoding Variational Bayes - Original VAE paper establishing ELBO as the training objective with reparameterization trick

Expert Takes

ELBO is the consequence of a mathematical constraint, not a design choice. When the true posterior is intractable, variational inference substitutes an approximation and bounds the error. The tightness of that bound depends entirely on how expressive the approximate posterior family is. More flexible approximations — normalizing flows, for instance — narrow the gap between bound and reality, but never eliminate it without recovering the true posterior.

In any VAE training pipeline, ELBO is your primary diagnostic signal. Watch both components separately: if reconstruction loss plateaus while KL divergence keeps dropping, your latent space is collapsing. If KL stays high while reconstruction improves, the encoder is not learning useful structure. Logging these components independently during training gives you the information to adjust hyperparameters before wasting compute on a run that will not converge.

Every latent diffusion model generating images today starts with a VAE stage optimized by ELBO. That makes ELBO the training infrastructure behind billions of generated images, even though most users never see it. Teams building these systems care about ELBO tightness because a better-organized latent space means the diffusion model trains faster and produces sharper results. Math from over a decade ago still powers what ships today.

ELBO forces an explicit trade-off between faithfulness and simplicity. The reconstruction term demands accuracy; the KL term demands conformity to a prior. When that prior is a standard normal distribution — the default choice — we impose a strong structural assumption about what “organized” means. That assumption shapes what the model can represent. Choosing a different prior changes what the model learns, meaning seemingly neutral math encodes real decisions about what counts as meaningful variation.