Top K Routing

Also known as: Top-K Gating, Top-K Expert Selection, K-Expert Routing

Top K Routing: A gating mechanism in Mixture of Experts models that scores every available expert for each input token, then routes computation to only the k highest-scoring ones while keeping the rest inactive to save processing power.

Top-k routing is a gating strategy in Mixture of Experts models that selects the k highest-scoring expert subnetworks for each input token, so only a fraction of total parameters activates per step.

What It Is

When AI models grow to hundreds of billions of parameters, running every parameter for every input becomes prohibitively expensive. Top-k routing solves this by acting like an air traffic controller for neural network computation: it examines each incoming token, scores every available expert subnetwork, and directs the token to only the top k experts best qualified to handle it. The remaining experts stay idle for that token.

The routing mechanism works through a small learned neural network called a router (or gate). For each token, the router produces a probability score for every expert using a softmax function. According to Shazeer et al., the standard formula is G(x) = Softmax(KeepTopK(H(x), k)), where H(x) is the raw router output and KeepTopK zeros out all but the k largest values. The selected experts each process the token independently, and their outputs are combined as a weighted sum using the router’s softmax scores as weights.

The value of k determines how many experts handle each token. According to Fedus et al., top-1 routing (used in the Switch Transformer) sends each token to a single expert for maximum efficiency. Top-2 routing, which according to NVIDIA Blog is the approach used in Mixtral with 8 experts and 2 selected per token, balances efficiency with quality by blending two expert opinions. More recent designs push further — according to DeepSeek Technical Report, DeepSeek-V3 routes to 8 out of 256 experts plus one shared expert, demonstrating that the strategy scales to much larger expert pools.

How It’s Used in Practice

For most people encountering this term, top-k routing is the reason modern large language models can exist at their current scale without requiring proportionally more computing power. When you send a prompt to a model built on Mixture of Experts architecture, each token in your prompt passes through the router, which decides which subset of the model’s expertise handles it. A token about code might activate programming-specialized experts, while a token about medical terminology might activate different ones.

This is why some MoE-based models can have far more total parameters than dense models yet respond at comparable speed — each token only triggers a fraction of the full network. The trade-off is that the router itself must be trained well. Poor routing leads to some experts being overloaded while others sit unused, a problem called load imbalance that researchers address through auxiliary loss terms — extra penalties added during training that discourage the router from always picking the same few experts.

Pro Tip: When evaluating MoE-based models, pay attention to the k value and total expert count. A model with top-2 routing over 8 experts activates 25% of its expert capacity per token. A model with top-8 over 256 experts activates about 3%. Lower activation ratios generally mean better cost efficiency at inference time, but routing quality matters just as much as the ratio itself.

When to Use / When Not

Scenario	Use	Avoid
Building a model that needs to scale parameters without scaling compute linearly	✅
Training a small model where every parameter must contribute to every prediction		❌
Serving a model where different inputs benefit from specialized processing paths	✅
Deploying on hardware with limited memory that cannot hold multiple expert copies		❌
Designing an architecture where you want to add capacity by adding new experts later	✅
Working with tasks that require every parameter to process every input uniformly		❌

Common Misconception

Myth: Top-k routing means each expert permanently specializes in one topic, like a team where one person handles only math and another handles only language.

Reality: Expert specialization emerges through training but remains fluid and overlapping. Experts don’t carry fixed labels. The router learns soft preferences — a particular expert might handle syntax-heavy tokens more often, but it can still be selected for other tasks. The specialization is statistical, not categorical, and it shifts as training progresses.

One Sentence to Remember

Top-k routing is the decision layer that picks which few experts process each token in an MoE model, turning a massive network into an efficient one where only the most relevant parameters activate for any given input.

FAQ

Q: What is the difference between top-1 and top-2 routing in MoE models? A: Top-1 sends each token to one expert for maximum speed. Top-2 blends two expert outputs per token, improving quality at the cost of roughly double the expert computation.

Q: Does top-k routing decide which expert handles which topic? A: Not directly. The router learns scoring patterns during training. Specialization emerges as a side effect — no human assigns topics to experts beforehand.

Q: Can the k value change during inference? A: Typically no. The k value is fixed during training and stays the same at inference. Changing k after training would alter the output distribution the model was optimized for.

Sources

Shazeer et al.: Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer - Foundational paper introducing the sparsely-gated MoE architecture and the top-k routing formula
Fedus et al.: Switch Transformers - Simplified MoE approach demonstrating effective top-1 routing at scale

Expert Takes

MONA

Top-k routing reduces per-token computational complexity across the expert layer from proportional-to-all-experts down to proportional-to-k. The softmax gating function produces a differentiable selection mechanism, meaning the router trains end-to-end through standard backpropagation. The mathematical elegance here is that zeroing out non-top-k logits creates sparsity while preserving gradient flow through the selected paths. Sparse activation with dense training — that’s the core trick.

MAX

If you’re working with an MoE-based model through an API, the routing happens invisibly — you don’t configure k or choose experts. But understanding top-k routing helps you reason about why the same model might produce subtly different response qualities across different topics. It also explains certain latency patterns: token-level routing means response time depends partly on expert availability and load distribution, not just sequence length alone.

DAN

The k value is a strategic lever. Lower k means cheaper inference per token. Higher k means richer expert blending per token. Every major MoE deployment makes this trade-off differently, and whoever gets the ratio right ships models that cost less to run without sacrificing output quality. Selecting k is not just an engineering decision — it directly shapes the cost-to-quality curve of the entire serving infrastructure.

ALAN

Routing opacity raises a question worth sitting with. When a model routes a medical question to certain experts and a legal question to others, the selection criteria are learned, not designed. Nobody specified which expert should handle health advice. The routing network optimized for loss reduction, not for domain competence. That gap between what the optimization target measures and what real-world stakes demand deserves closer scrutiny from anyone deploying these models.

Back to Glossary