Models & Platforms

Pre-training

The initial training phase where a language model learns general language patterns from a massive text corpus, before being fine-tuned for specific tasks or behaviors.

Why it matters

Pre-training is where a model acquires its general knowledge and language understanding. It is the most expensive and time-consuming step in building a language model, often costing millions of dollars in compute.

How it works

During pre-training, a language model is given enormous amounts of text and trained to predict the next word (or token) in a sequence. By doing this billions of times across trillions of tokens, the model learns grammar, facts, reasoning patterns, and even some common sense — all as a side effect of getting better at prediction.

The training pipeline

Pre-training is stage one of a multi-stage process:

  • Pre-training — learn general language from a large corpus. This produces a base model that can complete text but is not yet a useful assistant.
  • Fine-tuning — train on curated examples of helpful conversations and task completion.
  • RLHF / alignment — use human feedback to refine behavior, safety, and helpfulness.

Cost and scale

Pre-training a frontier model requires thousands of GPUs running for weeks or months. GPT-4 and Claude-scale models are estimated to cost tens of millions of dollars in compute alone. This cost is why most teams use pre-trained models through APIs rather than training from scratch.