Feedforward Network

Also known as: FFN, Feed-Forward Neural Network, Position-wise FFN

Feedforward Network: A neural network where data moves in one direction from input to output with no loops or cycles, used as a core processing sub-layer inside each Transformer block to transform learned representations.

A feedforward network is a neural network where information flows strictly in one direction, from input to output, serving as a core processing layer inside every Transformer block in modern language models.

What It Is

When you build or fine-tune a Transformer model using tools like Hugging Face or PyTorch, every layer in the model contains two key components: a multi-head attention mechanism and a feedforward network. The attention mechanism figures out which parts of the input matter most for each token. The feedforward network then takes those attention-refined representations and transforms them further, one position at a time.

Think of a feedforward network like a one-way assembly line. Raw material enters one end, gets shaped by a series of steps, and exits the other end as a finished product. This one-directional flow is what separates feedforward networks from recurrent networks, which loop outputs back as inputs and process tokens one after another.

Inside a Transformer, the feedforward sub-layer works on each token position independently. According to Vaswani et al., it applies two linear transformations with an activation function in between. The data first expands from the model’s base dimension into a larger intermediate space, typically four times wider, then compresses back to the original size. This expand-and-compress pattern lets the network learn complex patterns without permanently inflating the data dimensions.

The formula: FFN(x) = W2 * activation(W1 * x + b1) + b2. The first weight matrix (W1) projects the input into a higher-dimensional space. The activation function, usually ReLU or GELU, introduces non-linearity so the network can capture relationships beyond simple linear mappings. The second weight matrix (W2) projects everything back down. Because this happens at every position independently, the operation is called “position-wise,” and parallelizes efficiently on GPUs.

Some newer architectures use gated activation variants like SwiGLU or GeGLU for better training stability, but the core expand-transform-compress structure remains unchanged across Transformer-based models.

How It’s Used in Practice

If you are building or fine-tuning Transformer models with Hugging Face and PyTorch, you interact with feedforward networks every time you configure a model’s architecture. The hidden dimension of the FFN layer, often labeled intermediate_size or d_ff in configuration files, is one of the main levers for controlling model capacity. Increasing it gives the model more room to learn, but also increases memory usage and training time. According to PyTorch Docs, these layers are built using torch.nn.Linear, which handles the matrix multiplications behind each transformation.

When fine-tuning a pretrained model, the feedforward layers are where much of the adaptation happens. Techniques like LoRA (a parameter-efficient fine-tuning method) often target attention weights, but the FFN weights store a large portion of the model’s learned knowledge. Understanding which layers to freeze and which to update during fine-tuning starts with knowing what each component does.

Pro Tip: When you hit out-of-memory errors during fine-tuning, reducing the feedforward intermediate dimension is often more effective than cutting attention heads. The FFN layers typically account for about two-thirds of a Transformer block’s parameters, so shrinking them frees up significant GPU memory without fundamentally changing how the model processes context.

When to Use / When Not

Scenario	Use	Avoid
Building a standard Transformer model for text tasks	✅
Processing sequences where each position needs independent transformation	✅
Tasks requiring the model to remember outputs from previous time steps		❌
Designing lightweight models with strict memory constraints		❌
Adding a processing layer after attention in a custom architecture	✅
Problems where sequential, time-dependent state is critical		❌

Common Misconception

Myth: Feedforward networks in Transformers process the entire sequence at once, blending information across all tokens. Reality: The feedforward sub-layer in a Transformer processes each token position completely independently. Cross-token interaction happens only in the attention layer. The FFN transforms each position’s representation in isolation, which is why it is called “position-wise.”

One Sentence to Remember

The feedforward network is the Transformer’s workhorse processor, sitting after the attention layer in every block, expanding each token’s representation into a richer space and then compressing it back, giving the model the capacity to learn patterns that attention alone cannot capture.

FAQ

Q: What is the difference between a feedforward network and a recurrent neural network? A: A feedforward network passes data in one direction with no loops. A recurrent network feeds outputs back as inputs, allowing it to maintain state across sequential steps but making it harder to parallelize.

Q: Why does the feedforward layer expand and then compress the dimensions? A: The expansion into a larger intermediate space lets the network learn more complex transformations. Compressing back keeps the output compatible with the rest of the Transformer block and prevents parameter counts from growing uncontrollably.

Q: Can you modify the feedforward layer when fine-tuning a pretrained model? A: Yes. FFN layers hold a large share of model parameters and store much of the learned knowledge. Fine-tuning approaches often target these weights, and adjusting the intermediate dimension can help balance quality against compute resources.

Sources

Vaswani et al.: Attention Is All You Need - Original Transformer paper defining the position-wise feedforward network architecture
Wikipedia: Feedforward neural network - General overview of feedforward network concepts and history

Expert Takes

MONA

The feedforward sub-layer is mathematically a two-layer perceptron applied identically at every sequence position. Its role is to introduce non-linear capacity that the linear attention projections cannot provide on their own. The intermediate expansion ratio, typically a factor of four relative to the model dimension, balances expressiveness against computational cost. Without this non-linear step, stacking attention layers would collapse into a single linear transformation, severely limiting what the model can represent.

MAX

When configuring Transformer models in code, the feedforward dimensions are the first architectural parameter worth adjusting. The intermediate_size field in a Hugging Face model config directly controls FFN width. If you are building custom layers in PyTorch, each FFN is two Linear calls with an activation between them. Getting this dimension right matters more than most hyperparameters because FFN layers consume the majority of parameters in each Transformer block.

DAN

Feedforward layers are where most of a Transformer’s parameter budget goes, and that has direct implications for deployment costs. Organizations fine-tuning models need to understand this tradeoff: wider FFN layers improve the model’s ability to encode knowledge, but they also drive up inference costs per token. Teams running models at scale are increasingly experimenting with sparse FFN variants like Mixture of Experts, where only a fraction of the feedforward capacity activates per input, cutting compute while preserving quality.

ALAN

The feedforward network’s simplicity masks a deeper question about how these models store and retrieve knowledge. Research suggests that FFN layers function as key-value memories, where the first projection acts as a key lookup and the second as value retrieval. If that framing holds, then fine-tuning FFN weights is not just adjusting parameters. It is rewriting what the model “knows.” Anyone deploying fine-tuned models should consider what knowledge they are overwriting and whether the original information still matters.

Back to Glossary