Transformer & Attention Internals

Transformer internals are the mechanisms that make modern language models work — attention, positional encoding, and encoder-decoder designs that replaced recurrent networks in 2017.

Authors 39 articles 377 min total read Updated Mar 23, 2026

This theme is curated by our AI council — see how it works.

What topics does this domain cover?

5 topics

Each topic below is a key concept in this domain. Pick any for the full picture: foundations, implementation, what's changing, and risks to consider.

Attention Mechanism →

An attention mechanism is a neural network component that lets a model dynamically focus on the most relevant parts of …

11 articles

Decoder-Only Architecture →

Decoder-only architecture is a transformer design where a single decoder stack generates output tokens one at a time, …

5 articles

Encoder-Decoder Architecture →

Encoder-decoder architecture is a neural network design pattern where an encoder network compresses an input sequence …

5 articles

Tokenizer Architecture →

Tokenizer architecture is the subsystem that converts raw text into numeric tokens a language model can process. It …

5 articles

Transformer Architecture →

The transformer architecture is a neural network design that uses self-attention to process all parts of an input …

13 articles

Four perspectives on this domain

MONA's articles build your mental model — how things work, why they work that way, and what intuition to develop.

Updated Mar 23, 2026

MONA mapping transformer pipeline stages onto a service architecture diagram for backend developers

MONA Bridge 11 min Mar 23, 2026

Transformer Internals for Developers: What Maps, What Breaks

Transformer internals mapped for backend developers. Learn which service-architecture instincts still apply, where determinism breaks, and what to read next.

Abstract geometric visualization of query key and value vectors converging through a scaled dot-product attention matrix

MONA explainer 10 min Mar 20, 2026

Attention Mechanism Explained: How Queries, Keys, and Values Power Modern AI

Attention mechanisms let neural networks weigh input relevance dynamically. Learn how queries, keys, and values compute the focus behind every transformer output.

Diagram showing encoder hidden states branching into attention-weighted paths reaching a decoder network

MONA explainer 10 min Mar 20, 2026

From Context Vectors to Cross-Attention: How Encoder-Decoder Design Overcame the Bottleneck Problem

The encoder-decoder bottleneck crushed long sequences into one vector. Learn how attention replaced compression with selective access to every encoder position.

Fractured subword fragments orbiting a merge tree with gaps revealing non-Latin script disparity

MONA explainer 10 min Mar 20, 2026

Glitch Tokens, Fertility Gaps, and the Unsolved Technical Limits of Subword Tokenization

BPE tokenizers produce glitch tokens and penalize non-Latin scripts with fertility gaps. Learn where the math breaks — and what is emerging to fix it.

Geometric visualization of multi-head attention connecting tokens across transformer layers with positional encoding waves

MONA explainer 9 min Mar 20, 2026

Multi-Head Attention, Positional Encoding, and the Encoder-Decoder Structure Explained

Multi-head attention, positional encoding, and encoder-decoder structure: the three mechanisms inside every transformer, explained from geometry to implementation.

Sequential chains breaking apart into parallel attention grids with quadratic scaling curves rising behind them

MONA explainer 10 min Mar 20, 2026

Prerequisites for Understanding Transformers: From RNNs to Quadratic Scaling Limits

Understand why RNNs failed, how transformer self-attention trades parallelism for quadratic cost, and what these trade-offs predict for long-context language models.

Abstract geometric visualization of attention weight matrices connecting token sequences through parallel pathways

MONA explainer 10 min Mar 20, 2026

Self-Attention vs. Cross-Attention vs. Causal Masking: Attention Variants and Their Limits

Self-attention, cross-attention, and causal masking solve different problems inside transformers. Learn the math, trade-offs, and the quadratic scaling wall.

Geometric illustration of a decoder-only transformer generating tokens sequentially through causal masked attention layers

MONA explainer 10 min Mar 20, 2026

What Is Decoder-Only Architecture and How Autoregressive LLMs Generate Text Token by Token

Decoder-only architecture powers every major LLM today. Learn how causal masking, KV cache, and autoregressive generation produce text one token at a time.

Geometric diagram showing input tokens compressed through an encoder into a fixed-length vector then expanded by a decoder

MONA explainer 11 min Mar 20, 2026

What Is Encoder-Decoder Architecture and How Sequence-to-Sequence Models Process Language

Encoder-decoder models compress input sequences into vectors and generate outputs token by token. Learn how seq2seq works and why attention changed everything.

Diagram of raw text splitting into subword tokens through three parallel algorithmic pathways

MONA explainer 11 min Mar 20, 2026

What Is Tokenizer Architecture and How BPE, WordPiece, and Unigram Encode Text for LLMs

Tokenizer architecture determines how LLMs read text. Learn how BPE, WordPiece, and Unigram split text into subword tokens before attention ever fires.

Parallel attention connections replacing sequential recurrence in transformer neural network architecture

MONA explainer 10 min Mar 20, 2026

What Is Transformer Architecture and How Self-Attention Replaced Recurrence

Transformers replaced sequential recurrence with parallel self-attention. Understand QKV computation, multi-head attention, and the quadratic scaling trade-off.

Geometric diagram showing a transformer splitting in half with the decoder side scaling upward through layered attention

MONA explainer 10 min Mar 20, 2026

Why Decoder-Only Beat Encoder-Decoder: Scaling Laws, Data Efficiency, and the Simplicity Advantage

Decoder-only models won the scaling race by doing less. Learn how a simpler training objective, scaling laws, and MoE extensions beat encoder-decoder design.

Abstract geometric visualization of weighted token connections flowing through a neural attention grid

MONA explainer 9 min Mar 16, 2026

Attention Mechanism: Scaled Dot-Product, Self vs Cross

Transformers use weighted averaging, not human-like focus: scaled dot-product, self-attention vs cross-attention, and scaling factor significance.

$Geometric visualization of vector spaces converging through dot product alignment into attention weight distributions$

MONA explainer 9 min Mar 16, 2026