Transformer Architecture
A neural network architecture that powers modern AI by processing entire input sequences simultaneously through an attention mechanism, rather than reading them word by word.
Why it matters
Every major AI model you interact with today — Claude, GPT, Gemini, Llama — is built on the transformer architecture. Understanding how self-attention and parallel processing work gives you a mental model for why these systems have the capabilities and limitations they do: why they handle long documents differently, why context windows matter, and why scaling compute translates into better performance. If you're evaluating AI tools or building on top of them, the transformer is the foundational concept everything else builds on.
How It Works
The transformer's core innovation is self-attention — a mechanism that lets the model weigh the importance of every part of an input relative to every other part, all at once. For each token in a sequence, the model computes three vectors:
- Query — what this token is looking for
- Key — what this token offers to others
- Value — the actual information this token carries
By computing dot products between queries and keys, the model builds an attention map that determines how much each token should attend to every other token. This happens across multiple "heads" in parallel (multi-head attention), allowing the model to capture different types of relationships — syntactic, semantic, positional — simultaneously.
Crucially, this parallel processing replaces the sequential bottleneck of earlier architectures like RNNs and LSTMs, which had to process tokens one at a time. That parallelism is what made transformers practical to train on massive datasets using GPU clusters.
Key Variations
The original transformer had both an encoder and a decoder, but the field has since split into three dominant patterns:
- Encoder-only (BERT, RoBERTa) — processes the full input bidirectionally, best suited for classification, search, and understanding tasks
- Decoder-only (GPT, Claude, Llama) — generates text left-to-right, the dominant architecture for modern large language models and chatbots
- Encoder-decoder (T5, BART) — combines both, often used for translation, summarization, and structured generation tasks
Historical Context
The transformer was introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google Brain. It was originally designed to improve machine translation, but researchers quickly discovered its generality. By 2019, GPT-2 demonstrated that scaling decoder-only transformers produced emergent capabilities, and the architecture has since become the foundation for virtually every frontier AI system — from language models to image generators to protein structure predictors.