Models & Platforms

Next-Token Prediction

The core mechanism of large language models: given a sequence of text, predict the most likely next piece (token), then repeat to generate coherent text one token at a time.

Why it matters

Understanding next-token prediction explains both the strengths and limitations of LLMs. It is why they can write fluently but sometimes hallucinate — they are optimizing for plausibility, not truth.

How LLMs generate text

When you prompt an LLM, it does not "think" about the answer and then write it. Instead, it looks at all the tokens so far (your prompt plus any text it has already generated) and predicts a probability distribution over what should come next. It samples from that distribution, appends the result, and repeats. Every word you see was generated one token at a time.

Why it works so well

Predicting the next token turns out to require a surprising amount of knowledge. To correctly predict what comes next in a medical textbook, you need to understand medicine. To predict the next line of code, you need to understand programming. The training objective is simple, but the capabilities that emerge from it are complex.

Why it sometimes fails

Next-token prediction optimizes for plausibility, not accuracy. A plausible-sounding answer is not always a correct one. This is the root cause of hallucination: the model generates text that reads well but contains fabricated facts, because the pattern-matching produced a confident-sounding but wrong continuation.

Related terms

Large Language Model(LLM)A neural network trained on massive text corpora that can understand and generate human language, typically with billions of parameters.TokenThe basic unit of text that a language model processes — typically a word, subword, or punctuation mark, roughly equivalent to 3/4 of an English word.HallucinationWhen an AI model generates information that sounds plausible but is factually incorrect, fabricated, or unsupported by its training data or provided context.TemperatureA parameter that controls how random or deterministic an LLM's output is — lower values produce more predictable, focused responses while higher values increase creativity and variation.