Mixture of Experts
A neural network architecture that scales model capacity efficiently by routing each input through only a small subset of specialized sub-networks ("experts"), keeping compute costs manageable even as total model size grows.
Why it matters
MoE has emerged as the dominant architecture for pushing AI capabilities forward without proportionally increasing costs. It is the reason models like DeepSeek can compete with systems that cost 10x more to train. For practitioners evaluating models, understanding MoE explains why some models have surprisingly good performance relative to their inference cost — and why their memory requirements might be higher than expected. As the industry moves toward ever-larger models, MoE is the scaling strategy that makes frontier AI economically viable.
How Routing Works
In a standard transformer, every input token passes through every layer and every parameter. In an MoE model, each transformer layer contains multiple parallel "expert" sub-networks — typically feed-forward networks — and a lightweight gating network (router) that decides which experts handle each token.
- Gating network — a small neural network that takes each token's representation and outputs a probability distribution across all available experts
- Sparse activation — only the top-K experts (usually 1 or 2) are activated for each token, meaning most of the model's parameters sit idle for any given input
- Top-K selection — the router picks the highest-scoring experts, runs the token through them, and combines the outputs using the gating weights
This sparse activation is the key insight: the model can have hundreds of billions of total parameters but only use a fraction of them per forward pass, keeping inference costs proportional to the active parameter count rather than the total.
Scaling Economics
MoE architectures fundamentally change the cost equation for AI. A model like DeepSeek-V3 has 671 billion total parameters but activates only 37 billion per token — giving it the knowledge capacity of a much larger model at roughly the inference cost of a mid-size one.
- Total vs. active parameters — MoE models are described by both numbers (e.g., "671B total / 37B active"), and the active count determines compute cost
- Training efficiency — MoE models can be trained on fewer FLOPs than dense models of equivalent quality, because each expert specializes and learns more efficiently within its domain
- Memory trade-off — all expert parameters must be loaded into memory even though most are idle, which means MoE models need more VRAM than their active parameter count suggests
Key Examples
- Mixtral 8x7B (Mistral) — eight 7B-parameter experts with 2 active per token, delivering performance competitive with much larger dense models
- DeepSeek-V3 — the model that proved MoE can compete with frontier labs at a fraction of training cost, shaking up industry cost assumptions
- GPT-4 (rumored) — widely reported to use an MoE architecture, though OpenAI has never confirmed the specifics