Pre-training
The initial training phase where a language model learns general language patterns from a massive text corpus, before being fine-tuned for specific tasks or behaviors.
Why it matters
Pre-training is where a model acquires its general knowledge and language understanding. It is the most expensive and time-consuming step in building a language model, often costing millions of dollars in compute.
How it works
During pre-training, a language model is given enormous amounts of text and trained to predict the next word (or token) in a sequence. By doing this billions of times across trillions of tokens, the model learns grammar, facts, reasoning patterns, and even some common sense — all as a side effect of getting better at prediction.
The training pipeline
Pre-training is stage one of a multi-stage process:
- Pre-training — learn general language from a large corpus. This produces a base model that can complete text but is not yet a useful assistant.
- Fine-tuning — train on curated examples of helpful conversations and task completion.
- RLHF / alignment — use human feedback to refine behavior, safety, and helpfulness.
Cost and scale
Pre-training a frontier model requires thousands of GPUs running for weeks or months. GPT-4 and Claude-scale models are estimated to cost tens of millions of dollars in compute alone. This cost is why most teams use pre-trained models through APIs rather than training from scratch.