Foundations·Beginner·Module 3 of 8

How LLMs Work: The Engine Under Every AI Tool You Use

An LLM does one thing: predict the next word. That single idea explains hallucination, creativity, prompt sensitivity, and every other behaviour that surprises AI users.

25 minBeginner

What you'll learn

Explain next-token prediction and why it matters for how you use AI

Understand what tokens and context windows are and why they affect your results

Recognise why LLMs are confidently wrong, and why that is a feature of the architecture, not a fixable bug

Next-token prediction: the one idea that explains (almost) everything

You type “The capital of New Zealand is” into an AI tool. It responds: “Wellington.”

The model doesn’t know geography. What it knows is that the token “Wellington” followed the sequence “The capital of New Zealand is” more often than any other token in the text it was trained on. It predicted the statistically most likely next word. That’s it. That’s the whole thing.

An LLM does one operation: given all the text so far, predict the most probable next token. A token is roughly a word or piece of a word. “Understanding” might be one token, while “unbelievable” gets split into “un,” “believ,” “able.” At each step, the model calculates probability scores for every token in its vocabulary. GPT-2’s vocabulary has 50,257 tokens. Every single prediction is the model choosing from those 50,000+ options.

Once it picks a token, it adds that token to the sequence and predicts the next one. Then the next. Then the next. Your phone’s autocomplete does something similar, just at a much simpler level. An LLM is autocomplete trained on a significant chunk of the internet’s text, with billions of parameters tuned to make those predictions sophisticated enough to write essays, summarise reports, and generate working code.

Key Term: Next-Token Prediction — The core task an LLM is trained to do: given a sequence of tokens, predict what comes next. Every capability of LLMs (writing, reasoning, coding) emerges from this single training objective applied at massive scale. See the Glossary for details.

This single mechanism explains almost every behaviour that surprises people.

Hallucination? The model predicts plausible text, not true text. If “Wellington” and a fabricated fact both score high on the probability distribution, the model has no way to prefer the true one.

Creativity? When sampling from probability distributions, the model doesn’t always pick the single most likely token. A setting called temperature controls how much randomness gets introduced. Higher temperature means more variation, more creative, and more unpredictable output.

Prompt sensitivity? Different input context creates a different probability landscape. Rewording your question changes which patterns the model draws on, which changes the prediction at every step.

Identical confidence whether right or wrong? The model doesn’t have a confidence mechanism. It predicts the next likely token with the same fluency regardless of whether the prediction is factually sound. That consistent, assured tone is a product of training, not an indicator of accuracy.

Visualising next-token prediction. A prompt enters, the model outputs probability scores across its vocabulary, the highest-probability token is selected, appended to the sequence, and the process repeats. Show 3-4 steps of this loop with a real-world example like "The best way to improve" → "your" → "writing" → "is".

Tokens and context windows: the practical constraints

You paste a 50-page report into Claude and ask for a summary. The summary of the opening sections is sharp and specific. The middle of the report gets vague generalisations. The closing pages are covered well again. That pattern isn’t random.

Everything the model can “see” at once (your prompt, any system instructions, uploaded documents, conversation history, and the response it’s generating) has to fit inside a fixed-size window measured in tokens. This is the context window, and it’s the model’s entire working memory.

Current context window sizes vary widely. Claude Sonnet processes up to 200,000 tokens (roughly 150,000 words). GPT-5 handles 400,000. Gemini 3 Pro stretches to 2 million. Those numbers sound enormous. But there’s a catch.

Advertised capacity and effective capacity aren’t the same thing. Research consistently shows that models handle information at the beginning and end of context well (85-95% accuracy), while information in the middle degrades to 76-82% accuracy. This is called the “Lost in the Middle” problem, and it’s been documented across multiple model families.

It gets worse. A model advertising 200,000 tokens typically becomes unreliable somewhere around 130,000, roughly 60-70% of the stated maximum. The drop isn’t gradual. Performance holds steady and then falls off sharply.

Tip: Put your most important information at the start of your prompt, not buried in the middle. If you’re uploading multiple documents, the first and last get more reliable attention than the ones in between.

The practical implication: more context is not always better. A model with 200K tokens and clean, focused context will outperform a model with 2 million tokens and noisy, unfocused context. This is one reason RAG systems (which we’ll cover in Module 6) still matter even as context windows grow. They help select the right information to put in front of the model, not just all of it.

Key Term: Context Window — The maximum amount of text an LLM can process in a single interaction, measured in tokens. Think of it as the model’s working memory. Everything (your prompt, system instructions, documents, and the response) has to fit. See the Glossary for details.

Visualising a context window. A fixed-size container with prompt, system instructions, documents, and response all competing for space. Overlay a curve showing accuracy by position: high at the edges, dipping in the middle. Include approximate 2026 context window sizes for major models.

Try This: Ask an AI model about something very specific and obscure from your professional domain, something that wouldn’t appear often in training data. Notice how it responds with the same confident tone it uses for common knowledge. Then ask it to cite its sources. The gap between confidence and verifiability is next-token prediction in action.

The training pipeline: pre-training, fine-tuning, RLHF

If you’ve noticed that ChatGPT and Claude feel different, that one is more cautious while the other engages more freely, the explanation isn’t the underlying technology. It’s what happened after the base model was built.

LLM training follows three stages, each shaping the model’s behaviour in different ways.

Stage 1: Pre-training, or reading the internet.

The model processes trillions of tokens of text: web pages, books, academic papers, code repositories, forums. At this scale, training runs cost millions of dollars in compute. The result is a base model that can complete text fluently but has no concept of being helpful, answering questions, or being safe. Ask it a question and it might respond with another question. Or finish your sentence with something unrelated. It learned language, not conversation.

Think of it as giving someone access to every library on the planet and asking them to finish your sentences. They’d be good with words. They wouldn’t necessarily be helpful.

Stage 2: Fine-tuning, or learning to be an assistant.

The base model is trained on thousands of curated prompt-response pairs, written by humans. “If someone asks this, respond like this.” This is where the model learns to follow instructions, adopt a helpful tone, and produce structured answers. The “assistant” personality emerges here. Much smaller dataset than pre-training (tens of thousands of examples rather than trillions of tokens) but each one carefully crafted.

Stage 3: RLHF, or learning what humans prefer.

The model generates multiple responses to the same prompt. Human raters rank them: this one is better than that one. A separate model (the “reward model”) is trained on those rankings. The LLM is then optimised to produce responses the reward model scores highly.

This is where personality and values diverge between providers. Anthropic, OpenAI, and Google make different choices about what “good” looks like. Should the model refuse certain requests? How cautious should it be? How formal? Those aren’t technical decisions. They’re editorial ones, baked into the alignment process. Which is why Claude, ChatGPT, and Gemini can all be built on similar transformer architectures and still feel meaningfully different to use.

Misconception: “If I use a model long enough, it learns my preferences.” Reality: The model isn’t learning from your conversation (unless the provider explicitly retrains on user data, which most business tiers don’t do). By the time you use it, the training is done. You’re interacting with a frozen snapshot. Every conversation starts fresh. It doesn’t remember you from last time.

One more practical consequence of the training pipeline: knowledge cutoffs. Everything the model “knows” comes from the text it processed during pre-training. That data has a date. Events, papers, products, and facts that appeared after the cutoff are outside the model’s training. When you ask about something recent, the model fills in using patterns from similar contexts. Sometimes it gets close. Sometimes it fabricates. In both cases, it sounds equally sure.

Key Term: RLHF (Reinforcement Learning from Human Feedback) — A training technique where human evaluators rate model outputs, and those ratings train the model to produce better responses. This is the stage that gives models their personality: helpfulness, caution, conversational style. See the Glossary for details.

Why hallucination is architectural, not accidental

A colleague sends you an AI-generated report. It cites three academic papers. The citations look immaculate: authors, journal names, publication years, DOI numbers. You check them. Two don’t exist. The third exists but says something different from what the report claims.

This wasn’t a glitch. The model did exactly what it was designed to do.

When generating a citation, the model’s job is to predict the most likely next tokens. “Likely” means “follows the pattern of a real citation.” A plausible author name. A plausible journal. A plausible year. A plausible DOI format. The model generated the pattern of a citation, not a real one. It has no mechanism to check whether that citation corresponds to an actual paper.

Hallucination is structural for three reasons.

The generation mechanism itself. Next-token prediction is all the model does. There is no fact-checking step, no verification pass, no database query. The model predicts what text should come next based on statistical patterns. Truth and statistical likelihood are correlated. Most of the time, the likely next word is also the correct one. But they’re not the same thing.

Training data gaps. When asked about something that wasn’t well-represented in training data (a niche professional standard, a recent development, an obscure statistic) the model doesn’t say “I don’t know.” It fills the gap with tokens that are statistically plausible given the surrounding context. Plausible and accurate are different things.

One-directional generation. The model generates tokens left to right. Each token is committed before the next one is predicted. If token 50 creates an inconsistency with token 200, the model has no mechanism to go back and fix token 50. It’s locked in.

Misconception: “Hallucination will be fixed in the next model update, it’s just a bug.” Reality: Better training, bigger models, and improved alignment techniques can reduce hallucination rates. But the fundamental mechanism (statistical prediction with no built-in verification) is the architecture itself. Understanding this is more useful than waiting for a fix that isn’t coming in the way most people imagine.

We explored the practical side of this in our piece on The Real Reason AI Invents Facts (And How to Make It Stop). Mitigation strategies exist (grounding responses in source documents, asking for citations, lowering temperature, using RAG). Elimination doesn’t. The human reviewing AI output has a permanent role.

Apply This Monday

Take an AI-generated output from your recent work. Find one specific factual claim: a statistic, a date, a name, a citation. Check it against the original source. Was it accurate? Find a second claim and check that too. Track your hit rate. Over a few weeks, you’ll build calibrated intuition for which types of claims AI gets right and which it fabricates. That calibration is probably the most useful skill for working with AI.

Key takeaways

LLMs predict, they don't know. Every response is a statistical prediction of the most likely next token, not a retrieval from a knowledge database.

Context windows are finite working memory. Everything (your prompt, documents, and the model's response) has to fit. Information in the middle gets less reliable attention than information at the edges.

The training pipeline shapes personality. Pre-training gives knowledge, fine-tuning gives helpfulness, RLHF gives values. Different alignment choices are why Claude and ChatGPT feel different.

Hallucination is architectural, not accidental. A system designed to predict likely text has no built-in mechanism to check whether that text is true. This isn't getting "fixed." It's the architecture.

Understanding the mechanism changes how you use it. Knowing the model predicts rather than retrieves changes what you ask, how you structure prompts, and how much you trust the output.

Check your understanding

How LLMs Work: The Engine Under Every AI Tool You Use

Next-token prediction: the one idea that explains (almost) everything

Tokens and context windows: the practical constraints

The training pipeline: pre-training, fine-tuning, RLHF

Why hallucination is architectural, not accidental

Apply This Monday

Your colleague says: "GPT-5 is so advanced it basically understands what it's reading now." What would you tell them?

You paste a 30-page contract into an AI tool and ask it to identify every obligation and deadline. The tool misses an obligation on page 16 but catches ones on pages 2 and 28. What might explain this?

Your manager asks why Claude and ChatGPT give different answers to the same question. How would you explain this?

A team member says: "AI hallucination will be fixed in the next model update, it's just a bug." What's wrong with this claim?