Multi Head Attention

Also known as: MHA, multi-headed attention, parallel attention heads

Multi Head Attention
A mechanism inside transformers that splits attention into multiple parallel heads, each learning different relationships in the input, then combines their outputs for richer representations.

Multi-head attention is the mechanism that lets transformer models focus on multiple parts of an input simultaneously, enabling richer understanding of language, code, and structured data.

What It Is

Every time you ask an AI assistant to summarize a document or write code, the model needs to figure out which words in your input relate to which other words. A single attention function can track one type of relationship at a time — say, which noun a pronoun refers to. But language is layered. The word “bank” in “she sat on the river bank” needs grammatical context, positional context, and semantic context all at once. Multi-head attention solves this by running several attention functions in parallel, each one free to specialize in a different type of relationship.

Here is how it works mechanically. The model takes each input token and creates three vectors from it: a query (what am I looking for?), a key (what do I contain?), and a value (what information do I carry?). In single-head attention, there is one set of learned weights that produces these vectors. Multi-head attention splits this into h independent sets of weights — h “heads.” Each head projects queries, keys, and values into a smaller subspace, computes attention scores independently, and produces its own output. According to Vaswani et al., the original transformer used 8 heads with a model dimension of 512, giving each head a 64-dimensional subspace.

Think of it like a team of analysts reading the same report. One analyst tracks financial figures, another watches for legal risks, a third maps organizational relationships. Each reads the full document but focuses on a different angle. After they finish, their notes are combined into a single briefing. That concatenation step is exactly what happens in multi-head attention: all head outputs are concatenated and passed through a final linear projection to produce one unified representation.

According to Dive into Deep Learning, this design keeps computational cost roughly the same as single-head attention because each head operates on a smaller dimension (d_model divided by h), so the total number of parameters stays constant.

How It’s Used in Practice

When you type a prompt into ChatGPT, Claude, or any transformer-based tool, multi-head attention runs at every layer of the model. It is the core computation that determines how tokens interact. In a coding assistant like Cursor, multi-head attention helps the model understand that a variable declared thirty lines earlier is the same one referenced in the current function — one head tracks variable scope, another tracks syntax structure, another tracks data flow patterns.

Modern models have evolved the original design. Grouped Query Attention (GQA) and Multi-Query Attention (MQA) are variants that share key-value projections across groups of heads. This reduces memory usage during inference without losing much quality, which is why these variants appear in recent open-source models.

Pro Tip: If you are evaluating open-source models and see “GQA” in the spec sheet, that means the model uses a memory-efficient variant of multi-head attention. It will generally run faster and use less GPU memory at inference time compared to full MHA — a practical advantage when you are choosing between models for deployment.

When to Use / When Not

ScenarioUseAvoid
Building or fine-tuning a transformer-based model
Processing very long sequences where memory is tight
Tasks requiring understanding of multiple relationship types (translation, summarization)
Simple pattern matching on short fixed inputs
Replacing recurrent layers in a sequence model
Real-time applications on extremely constrained hardware

Common Misconception

Myth: More attention heads always produce better results, so bigger is always better. Reality: Each head operates on a smaller slice of the model dimension. Past a certain point, adding heads shrinks each subspace so much that individual heads lose the capacity to learn meaningful patterns. The original 8-head design was chosen because it balanced diversity of attention patterns with sufficient per-head capacity. Research has shown that some heads in trained models are redundant and can be pruned without hurting performance.

One Sentence to Remember

Multi-head attention lets a model ask several different questions about the same input at the same time, then merge the answers — and understanding this mechanism is the key to grasping why transformers handle language so well.

FAQ

Q: How does multi-head attention differ from single-head attention? A: Single-head attention computes one set of relationships between tokens. Multi-head splits this into parallel independent computations, each learning different patterns, then combines them for a richer representation.

Q: Does multi-head attention increase the number of model parameters? A: Not significantly. Each head works on a proportionally smaller dimension, so the total parameter count stays roughly equal to a single-head version with the same model dimension.

Q: What are GQA and MQA? A: Grouped Query Attention and Multi-Query Attention are efficiency variants that share key-value projections across heads, reducing memory during inference while preserving most of the quality benefits of full multi-head attention.

Sources

Expert Takes

Multi-head attention is a factorization strategy. Instead of learning one large attention matrix, you decompose it into h smaller matrices, each capturing a distinct subspace of token relationships. The critical insight is that the total parameter budget remains constant — you trade breadth per head for diversity across heads. This decomposition is what gives transformers their ability to represent multiple linguistic phenomena simultaneously within the same layer.

When you write a prompt that references information from three paragraphs back, multi-head attention is the reason the model can track that dependency. Each head in the stack acts like a separate routing channel — one maps co-reference, another tracks syntactic structure, another handles semantic similarity. For anyone building context-aware applications, understanding this parallel routing explains both the strengths and the token-limit constraints of current models.

Multi-head attention is the single architectural decision that made large language models commercially viable. Before transformers, sequence models processed tokens one at a time and struggled to scale. Parallel attention heads allowed training on massive datasets with GPU clusters, which is what turned language models from research curiosities into products that generate real revenue. Every AI tool you evaluate today runs on this mechanism.

The fact that attention heads specialize without explicit instruction raises a question worth sitting with: we design the architecture, but we do not choose what each head learns. Some heads in trained models attend to positional patterns, others to syntactic roles, and some appear to do nothing useful at all. This emergent specialization means we build systems whose internal reasoning pathways we can observe but not fully predict or control.