Next-Token Prediction
The core mechanism of large language models: given a sequence of text, predict the most likely next piece (token), then repeat to generate coherent text one token at a time.
Why it matters
Understanding next-token prediction explains both the strengths and limitations of LLMs. It is why they can write fluently but sometimes hallucinate — they are optimizing for plausibility, not truth.
How LLMs generate text
When you prompt an LLM, it does not "think" about the answer and then write it. Instead, it looks at all the tokens so far (your prompt plus any text it has already generated) and predicts a probability distribution over what should come next. It samples from that distribution, appends the result, and repeats. Every word you see was generated one token at a time.
Why it works so well
Predicting the next token turns out to require a surprising amount of knowledge. To correctly predict what comes next in a medical textbook, you need to understand medicine. To predict the next line of code, you need to understand programming. The training objective is simple, but the capabilities that emerge from it are complex.
Why it sometimes fails
Next-token prediction optimizes for plausibility, not accuracy. A plausible-sounding answer is not always a correct one. This is the root cause of hallucination: the model generates text that reads well but contains fabricated facts, because the pattern-matching produced a confident-sounding but wrong continuation.