Models & PlatformsMoE

Mixture of Experts

A neural network architecture that scales model capacity efficiently by routing each input through only a small subset of specialized sub-networks ("experts"), keeping compute costs manageable even as total model size grows.

Why it matters

MoE has emerged as the dominant architecture for pushing AI capabilities forward without proportionally increasing costs. It is the reason models like DeepSeek can compete with systems that cost 10x more to train. For practitioners evaluating models, understanding MoE explains why some models have surprisingly good performance relative to their inference cost — and why their memory requirements might be higher than expected. As the industry moves toward ever-larger models, MoE is the scaling strategy that makes frontier AI economically viable.

How Routing Works

In a standard transformer, every input token passes through every layer and every parameter. In an MoE model, each transformer layer contains multiple parallel "expert" sub-networks — typically feed-forward networks — and a lightweight gating network (router) that decides which experts handle each token.

Gating network — a small neural network that takes each token's representation and outputs a probability distribution across all available experts
Sparse activation — only the top-K experts (usually 1 or 2) are activated for each token, meaning most of the model's parameters sit idle for any given input
Top-K selection — the router picks the highest-scoring experts, runs the token through them, and combines the outputs using the gating weights

This sparse activation is the key insight: the model can have hundreds of billions of total parameters but only use a fraction of them per forward pass, keeping inference costs proportional to the active parameter count rather than the total.

Scaling Economics

MoE architectures fundamentally change the cost equation for AI. A model like DeepSeek-V3 has 671 billion total parameters but activates only 37 billion per token — giving it the knowledge capacity of a much larger model at roughly the inference cost of a mid-size one.

Total vs. active parameters — MoE models are described by both numbers (e.g., "671B total / 37B active"), and the active count determines compute cost
Training efficiency — MoE models can be trained on fewer FLOPs than dense models of equivalent quality, because each expert specializes and learns more efficiently within its domain
Memory trade-off — all expert parameters must be loaded into memory even though most are idle, which means MoE models need more VRAM than their active parameter count suggests

Key Examples

Mixtral 8x7B (Mistral) — eight 7B-parameter experts with 2 active per token, delivering performance competitive with much larger dense models
DeepSeek-V3 — the model that proved MoE can compete with frontier labs at a fraction of training cost, shaking up industry cost assumptions
GPT-4 (rumored) — widely reported to use an MoE architecture, though OpenAI has never confirmed the specifics

On the AI Radar

DeepSeekA Chinese open-weight model family with strong reasoning capabilities. DeepSeek R1 and successors power a meaningful slice of Asian-market answer engines and cost-sensitive enterprise deployments globally.MistralFrench AI lab shipping open-weight and proprietary models, plus the Le Chat consumer answer engine. The default frontier-grade option for EU enterprises with sovereignty or data-residency constraints, and a meaningful European AEO surface in its own right.GPT-5 FamilyOpenAI's current model family, behind ChatGPT and the OpenAI API. ChatGPT Search runs on GPT-5 and is the highest-volume AI surface where consumer questions about businesses get answered.