RLHF
A training technique where human preferences are used to fine-tune a language model through reinforcement learning, teaching it to produce responses that humans judge as helpful, accurate, and safe.
Why it matters
RLHF is the step that turns a raw language model into a useful assistant. It is why ChatGPT and Claude feel helpful rather than just generating plausible-sounding text, and it is a key lever for AI safety.
The three-step process
RLHF typically involves three stages that happen after the model has been pre-trained on text:
- Supervised fine-tuning (SFT) — the model is trained on high-quality example conversations written by humans, teaching it the format and style of helpful responses.
- Reward model training — human raters compare pairs of model outputs and select the better one. A separate model learns to predict these preferences, producing a quality score for any response.
- RL optimization — the language model generates responses, the reward model scores them, and the language model is updated to produce responses that score higher. This runs for many iterations.
Why it matters
Pre-trained models can generate fluent text but have no inherent preference for being truthful, helpful, or safe. RLHF encodes human values into the model's behavior. It is also why different AI assistants from different companies can feel quite different — their RLHF training reflects different value judgments.
Limitations
RLHF depends on the quality and consistency of human raters. It can also lead to reward hacking, where the model learns to produce responses that score well on the reward model without actually being better. Alternatives like Constitutional AI (CAI) and Direct Preference Optimization (DPO) address some of these limitations.