MONA explainer 10 min read April 9, 2026

From ReLU to SwiGLU: How Activation and Loss Functions Shape LLM Training

MONA tracing signal flow through neural network layers from ReLU to SwiGLU activation functions

Table of Contents

ELI5

Activation and loss functions control what a model learns during training. Activations gate which signals pass between layers. Loss measures prediction error and sends corrections backward. Together, they determine every weight update the model makes.

Same architecture. Same data. Same compute budget. Change one function — the Activation Function in the feed-forward block — and training quality shifts measurably. ReLU’s formula fits on a napkin. SwiGLU’s barely fills a paragraph. Yet the dominant open-weight models released after 2023 — LLaMA, Mistral, Gemma, DeepSeek — all chose SwiGLU, and the paper that introduced it offered no theoretical explanation for why it outperforms the alternatives (Shazeer, 2020). That gap between empirical result and missing theory is exactly the kind of anomaly worth tracing to its source.

The Gate That Decides What Gets Through

Every layer in a Neural Network Basics for LLMs receives a vector of numbers — activations from the previous layer — and must decide what to pass forward. Without an activation function, the entire network collapses into a single linear transformation, regardless of how many layers you stack. Multiplying matrices by matrices still yields a matrix. The activation function is what breaks that linearity and gives depth its purpose.

What activation functions are used in large language models

ReLU — Rectified Linear Unit — dominated deep learning for roughly a decade. Its formula is almost insulting in its simplicity: output the input if it is positive, output zero otherwise. No exponents, no probability distributions, no learned parameters. It trains fast because the gradient is either 1 or 0, and it solved the Gradient Descent bottleneck that had stalled deeper networks through the 2000s.

But ReLU carries a structural defect that becomes expensive at scale.

When an input falls below zero, the gradient is exactly zero — permanently. The neuron stops updating. In a small network, a few dead neurons are a rounding error. In a model with billions of parameters, dead neurons represent wasted capacity that never recovers. This is the “dying ReLU” problem (Ultralytics Glossary), and it is the pressure that drove the search for smoother alternatives.

GELU — Gaussian Error Linear Unit — appeared in 2016. Instead of a hard cutoff at zero, GELU multiplies the input by the probability that a standard Gaussian random variable would fall below that value: GELU(x) = x * Φ(x), where Φ is the standard Gaussian CDF (Hendrycks & Gimpel, 2016). The effect is that small negative inputs get attenuated rather than killed outright. The neuron dims instead of dying. GPT-2, GPT-3, and BERT all used GELU in their feed-forward layers — it became the default activation for the 2018-2022 generation of transformers.

Then came SwiGLU, and the architecture of the feed-forward block itself changed.

Noam Shazeer proposed SwiGLU in “GLU Variants Improve Transformer” (Shazeer, 2020). The difference is not just the nonlinearity — it is structural. Instead of one weight matrix transforming the input before an activation, SwiGLU uses a gated mechanism with two parallel linear projections: SwiGLU(x, W, V) = Swish(xW) ⊗ (xV), where ⊗ is element-wise multiplication and Swish is what PyTorch implements as SiLU. Three weight matrices per feed-forward layer instead of two. Roughly 50% more parameters per layer — but consistently better training quality on downstream tasks.

SwiGLU and GEGLU produced the best perplexities compared to ReLU and GELU baselines in experiments on T5 (Shazeer, 2020). LLaMA 2 and 3, Mistral, Mixtral, Gemma, PaLM, and DeepSeek all adopted SwiGLU (NVIDIA Docs). In the post-2023 wave of open-weight LLMs, SwiGLU became the default.

The theoretical justification for SwiGLU’s advantage remains incomplete. The empirical gain is reproducible across architectures and scales. The mechanism behind it is not fully characterized — which is either humbling or reassuring, depending on how much certainty you require before trusting your infrastructure.

One Number That Measures How Wrong You Are

A Loss Function takes the model’s prediction and the correct answer and returns a single number — the distance between them. That number is the error signal. Everything the model learns during training, it learns by making that number smaller through Backpropagation.

What is cross-entropy loss and how does it train language models

For next-token prediction — the training objective of autoregressive LLMs — the standard loss is Cross Entropy Loss. The formula is deceptively compact: L = -log(p_y), where p_y is the probability the model assigned to the correct next token (PyTorch Docs).

Think of the model’s output as a probability distribution stretched across the entire vocabulary — tens of thousands of possible tokens. Cross-entropy measures how far this predicted distribution sits from the ideal one, where the correct token has probability 1 and everything else has probability 0. The asymmetry is sharp — confident wrong answers cost exponentially more; assigning probability 0.01 to the correct token produces a much larger gradient than assigning 0.3.

Not a quirk. The mechanism.

That asymmetry means the model learns fastest precisely when it is most wrong. As predictions improve, the gradient diminishes and the model shifts from overhauling its representations to fine-tuning them. This is why training loss curves are steep early and shallow late — the loss function itself throttles the learning rate implicitly, independent of any optimizer schedule.

But there is a catch. The gradient that cross-entropy generates must travel backward through every layer of the network — dozens of them, sometimes over a hundred. And that journey has a failure mode that nearly killed deep learning before it began.

When the Gradient Disappears

Deep networks learn by passing error gradients backward from the output layer through every intermediate layer to the first. At each layer, the gradient is multiplied by the local derivative of that layer’s activation function. If those local derivatives are consistently less than 1, the gradient shrinks exponentially with depth. If they are consistently greater than 1, the gradient explodes.

Why do deep neural networks suffer from vanishing and exploding gradients

Sepp Hochreiter identified the vanishing gradient problem in his 1991 diploma thesis at TU Munich (Wikipedia). The arithmetic is blunt: a gradient passing through 100 layers, each multiplying by 0.9, arrives at the first layer scaled by 0.9 to the 100th power — roughly 0.00003. The early layers receive a learning signal so attenuated they barely update their weights. The network has depth on paper but learns as if its first layers were frozen in place.

Exploding gradients are the mirror failure. Local derivatives above 1, compounded over depth, produce gradients so large that weight updates overshoot and destabilize training. The model does not learn slowly — it destroys what it already learned.

I’ve watched training runs collapse in exactly this way: loss drops steadily for hours, then a single batch produces a gradient spike that sends the weights into a region from which the model never recovers. The loss curve doesn’t plateau. It detonates.

Three innovations, arriving within a five-year window, made modern depth possible: ReLU in 2010, batch normalization in 2015, and residual connections in 2015 (Wikipedia). ReLU addressed vanishing gradients because its derivative is exactly 1 for positive inputs — no shrinkage. Batch normalization stabilized the distribution of intermediate activations so layers stopped chasing moving targets. Residual connections created a gradient highway — a skip path that lets the error signal flow directly across layers without being multiplied down at each step.

Modern LLMs combine all three strategies, though the specific activation function has since evolved from ReLU through GELU to SwiGLU. The vanishing gradient problem is managed, not eliminated — and that distinction shapes every architectural decision, from layer count to learning rate scheduling to the choice of optimizer.

Evolution of activation functions from ReLU through GELU to SwiGLU showing gradient flow behavior at each stage — How activation functions evolved to keep gradients alive in deep networks.

What the Functions Predict About Your Training Run

The choice of activation function is not cosmetic. Switching from GELU to SwiGLU in a feed-forward block adds roughly 50% more parameters per layer — a concrete cost in memory and compute. The tradeoff is better gradient flow through the gated structure and, empirically, lower perplexity. For smaller models where parameter budget is tight, that overhead may not justify itself. For large-scale training runs where quality per compute dollar matters, the post-2023 consensus is clear.

Cross-entropy loss creates a specific training dynamic: the model learns fastest from its worst predictions and decelerates as it improves. If you see training loss plateau early, the gradient signal has likely weakened below the threshold where the optimizer can make progress — a sign to examine learning rate warmup, data mixing, or whether model capacity matches task complexity.

If training loss drops steeply but validation loss stalls, the loss function is doing its job — the model is overfitting, not underlearning. The fix lives in the data or regularization strategy, not in the loss function.

Rule of thumb: activation function choice determines the quality of forward signal propagation; loss function choice determines the quality of backward error propagation. Neither decision is independent of the other — they co-determine the gradient geometry of the optimization.

When it breaks: SwiGLU’s extra parameter cost becomes a liability when memory is the binding constraint. On edge devices or with very long context windows, the 50% parameter overhead per feed-forward layer can force tradeoffs between model depth and sequence length. The activation function that helps at scale can hurt at the margins.

The Data Says

Activation functions and loss functions are not accessories bolted onto the architecture. They are the mechanisms that determine whether depth produces learning or produces noise. The evolution from ReLU to SwiGLU traces a single engineering arc: keep the gradient alive, let the gate decide what passes through, and accept the parameter cost. Cross-entropy loss converts prediction error into a learning signal whose strength is proportional to the model’s confidence in the wrong answer. Every architectural choice in a modern LLM — layer count, normalization, residual connections — exists to protect that signal on its journey backward through depth.

Aha Moments

MAX

Mona just gave you a diagnostic guide wearing the skin of a history lesson. The activation function is a spec decision — one that changes your memory budget, your gradient stability profile, and your layer count ceiling in a single line of config. What I want engineers to take away: SwiGLU is not a drop-in replacement. It adds an extra weight matrix per feed-forward block. That means your parameter count changes, your optimizer state grows, and your memory profiling has to account for it before you commit. The right approach is to write down the tradeoff explicitly — memory cost, expected quality gain, breakeven point for your training budget. If you cannot fill in those figures, you are not making an architectural decision. You are making a guess.

DAN

Max wants a filled-in tradeoff table, and he is not wrong — but the bigger signal is strategic. The industry converged on SwiGLU not because theory demanded it, but because the empirical results were too consistent to ignore. Mona pointed out that the theoretical explanation is still missing. That pattern matters. When every major team independently reaches the same architectural choice without a shared theoretical framework, competitive pressure to extract quality per compute dollar is driving decisions faster than the science can explain them. For anyone building or buying foundation models, the practical question is not “why does SwiGLU work” but “what is the next function shift, and will your infrastructure be ready when it arrives?”

ALAN

Both of you are treating this as an optimization story — Max from the engineering side, Dan from the strategy side. Neither of you paused on the detail I find most uncomfortable: we are building systems whose internal mechanisms we admit we cannot fully explain, then trusting those systems with increasingly consequential work. Mona noted that SwiGLU’s advantage is “reproducible but not fully characterized.” Residual connections are a workaround for a problem we manage but have not solved — the gradient still vanishes, we just built highways around the bottleneck. The engineering pragmatism is understandable. The question it leaves unanswered is not: at what point does the gap between “it works” and “we understand why” become a risk we cannot engineer our way around?

AI-assisted content, human-reviewed. Images AI-generated. Editorial Standards · Our Editors