Encoder Decoder
Also known as: encoder-decoder architecture, enc-dec, sequence-to-sequence model
- Encoder Decoder
- A neural network design where an encoder compresses input into a fixed representation and a decoder generates output from that representation, forming the original transformer blueprint for tasks like translation and summarization.
An encoder-decoder is a two-part neural network architecture where the encoder reads and compresses input into a dense representation, and the decoder uses that representation to produce an output sequence step by step.
What It Is
If you’re studying how transformers work, the encoder-decoder design is where the story starts. Before GPT-style chatbots and Claude existed, the original transformer from 2017 was built as an encoder-decoder system to solve machine translation. Understanding this architecture gives you the foundation to see why later designs dropped one half or the other, and what they gained (and lost) in doing so.
Think of the encoder-decoder like a relay race with two runners. The encoder is the first runner: it reads the entire input (say, a French sentence) and compresses it into a baton, a dense numerical summary that captures the meaning. The decoder is the second runner: it takes that baton and produces the output (the English translation) one word at a time, always looking back at what it already generated plus the information packed into the baton.
According to Vaswani et al., the original transformer stacked 6 encoder layers and 6 decoder layers. Each encoder layer has two key components: a self-attention mechanism that lets every word in the input look at every other word (bidirectional), and a feed-forward network that processes each position independently. According to D2L, encoder self-attention is bidirectional, meaning the word “bank” in a sentence can look both left and right to figure out whether it means a river bank or a financial institution.
The decoder layers mirror this structure but add a twist. According to D2L, decoder self-attention is causal (masked), meaning each word can only attend to previous positions, not future ones. This prevents the decoder from “cheating” by peeking at words it hasn’t generated yet. On top of that, each decoder layer includes cross-attention, where the decoder queries the encoder’s output. According to Vaswani et al., cross-attention works by using queries from the decoder and keys/values from the encoder, allowing the decoder to focus on the most relevant parts of the input at each generation step.
This three-part structure (encoder self-attention, decoder masked self-attention, and cross-attention) is what distinguishes the full encoder-decoder from the simpler decoder-only models that power most modern chatbots.
How It’s Used in Practice
Most people encounter encoder-decoder models through translation services and summarization tools. When you paste a paragraph into a summarizer and get a shorter version back, there’s a good chance a model like T5 or BART is running under the hood. These models still use the full encoder-decoder design because the task genuinely has two distinct phases: understand the full input, then produce a different-length output.
In the context of understanding transformers and their prerequisites, the encoder-decoder is the reference architecture everything else branches from. Decoder-only models (GPT, Claude, Llama) dropped the encoder entirely and use only masked self-attention. Encoder-only models (BERT) dropped the decoder and focus on understanding input without generating new text. Knowing the original two-part design helps you see what each variant chose to keep and what it sacrificed.
Pro Tip: When you’re reading transformer papers or tutorials that mention “self-attention,” check whether they mean bidirectional (encoder-style) or causal/masked (decoder-style). Confusing the two is one of the most common sources of misunderstanding when you’re building mental models of how these architectures differ.
When to Use / When Not
| Scenario | Use | Avoid |
|---|---|---|
| Translating text from one language to another | ✅ | |
| Generating conversational responses in a chatbot | ❌ | |
| Summarizing long documents into shorter versions | ✅ | |
| Classifying text into categories (sentiment, topic) | ❌ | |
| Converting structured data into natural language descriptions | ✅ | |
| Building a text embedding model for search | ❌ |
Common Misconception
Myth: Modern LLMs like GPT and Claude use encoder-decoder architecture because they’re transformers. Reality: Most modern LLMs are decoder-only. They dropped the encoder entirely and rely on causal self-attention alone. The “transformer” label describes a family of architectures, not one fixed design. The full encoder-decoder is now mainly used in tasks where the input and output are structurally different, like translation or summarization.
One Sentence to Remember
The encoder-decoder is the original two-part transformer design that reads input completely before generating output, and understanding it is the key to seeing why modern LLMs chose to keep only the decoder half.
FAQ
Q: What is the difference between encoder-decoder and decoder-only models? A: Encoder-decoder models process the full input first, then generate output. Decoder-only models do both in a single pass using only causal self-attention, which simplifies training for open-ended generation.
Q: Why did most LLMs move away from encoder-decoder architecture? A: Decoder-only designs are simpler to scale and train on massive text corpora. They handle open-ended generation well without needing a separate encoder, which makes pre-training more straightforward.
Q: Is BERT an encoder-decoder model? A: No. BERT uses only the encoder half with bidirectional self-attention. It excels at understanding tasks like classification and search but cannot generate text autoregressively like a decoder.
Sources
- Vaswani et al.: Attention Is All You Need - The original 2017 paper introducing the transformer with its encoder-decoder design
- D2L: The Transformer Architecture — Dive into Deep Learning - Detailed walkthrough of encoder and decoder components with code examples
Expert Takes
The encoder-decoder split reflects a fundamental distinction in how neural networks handle information: encoding is about building a complete internal representation of the input, while decoding is about generating output conditioned on that representation. Bidirectional attention in the encoder and causal masking in the decoder are not arbitrary choices. They match the information flow requirements of each task phase precisely.
If you’re working with an AI tool that does translation or summarization, the encoder-decoder design is probably running behind the scenes. The practical difference from decoder-only matters when you’re choosing a model for a specific task. Document-in, summary-out workflows benefit from the two-stage process. Free-form chat does not, which is why your coding assistant uses a decoder-only model instead.
The encoder-decoder architecture shaped a generation of AI products, from Google Translate improvements to enterprise document summarization. The shift toward decoder-only models tells a business story too: companies found that one architecture trained on enough data could handle most tasks acceptably, reducing the need for task-specific model designs and lowering operational complexity.
The original encoder-decoder design forced a clean separation between understanding and generating. When decoder-only models merged these phases, they gained flexibility but lost that explicit boundary. Worth asking: does collapsing the distinction between comprehension and generation make models harder to audit, harder to interpret, and ultimately harder to trust when the stakes are high?