Models & Platforms

Transformer Architecture

A neural network architecture that powers modern AI by processing entire input sequences simultaneously through an attention mechanism, rather than reading them word by word.

Why it matters

Every major AI model you interact with today — Claude, GPT, Gemini, Llama — is built on the transformer architecture. Understanding how self-attention and parallel processing work gives you a mental model for why these systems have the capabilities and limitations they do: why they handle long documents differently, why context windows matter, and why scaling compute translates into better performance. If you're evaluating AI tools or building on top of them, the transformer is the foundational concept everything else builds on.

How It Works

The transformer's core innovation is self-attention — a mechanism that lets the model weigh the importance of every part of an input relative to every other part, all at once. For each token in a sequence, the model computes three vectors:

Query — what this token is looking for
Key — what this token offers to others
Value — the actual information this token carries

By computing dot products between queries and keys, the model builds an attention map that determines how much each token should attend to every other token. This happens across multiple "heads" in parallel (multi-head attention), allowing the model to capture different types of relationships — syntactic, semantic, positional — simultaneously.

Crucially, this parallel processing replaces the sequential bottleneck of earlier architectures like RNNs and LSTMs, which had to process tokens one at a time. That parallelism is what made transformers practical to train on massive datasets using GPU clusters.

Key Variations

The original transformer had both an encoder and a decoder, but the field has since split into three dominant patterns:

Encoder-only (BERT, RoBERTa) — processes the full input bidirectionally, best suited for classification, search, and understanding tasks
Decoder-only (GPT, Claude, Llama) — generates text left-to-right, the dominant architecture for modern large language models and chatbots
Encoder-decoder (T5, BART) — combines both, often used for translation, summarization, and structured generation tasks

Historical Context

The transformer was introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google Brain. It was originally designed to improve machine translation, but researchers quickly discovered its generality. By 2019, GPT-2 demonstrated that scaling decoder-only transformers produced emergent capabilities, and the architecture has since become the foundation for virtually every frontier AI system — from language models to image generators to protein structure predictors.

On the AI Radar

ClaudeAnthropic's flagship model family. When Claude searches the web, the answer comes back with the URLs it pulled from, which makes it the easiest answer engine to audit if you want to know what content of yours it actually trusts.GPT-5 FamilyOpenAI's current model family, behind ChatGPT and the OpenAI API. ChatGPT Search runs on GPT-5 and is the highest-volume AI surface where consumer questions about businesses get answered.