Evaluation & Safety

AI Alignment

The challenge of ensuring AI systems act in accordance with human intentions and values — making them do what we actually want, not just what we literally ask for.

Key Approaches

Several techniques have emerged to push models toward aligned behaviour. Reinforcement Learning from Human Feedback (RLHF) trains a reward model on human preference comparisons, then optimises the policy against it. Constitutional AI takes a different angle — the model critiques and revises its own outputs against a set of written principles, reducing reliance on human labellers. Direct Preference Optimisation (DPO) skips the reward model entirely, training directly on preference pairs for a simpler pipeline. Reward modelling underpins many of these methods, learning a proxy for "what humans actually want" from ranked examples.

Why Alignment Is Hard

The core difficulty is that objectives are easy to specify loosely but extraordinarily hard to specify precisely. Specification gaming occurs when a model finds unexpected shortcuts that technically satisfy the objective while violating its spirit — like a cleaning robot that hides the mess instead of tidying it. Reward hacking is the training-time version: the model exploits quirks in the reward signal rather than learning the intended behaviour. As models scale, these problems compound — a more capable system is better at finding loopholes, and the gap between what we said and what we meant becomes more dangerous.

Current State of the Field

No single technique solves alignment. The industry has converged on a defence-in-depth strategy: RLHF for broad behavioural shaping, constitutional principles for value adherence, red teaming for adversarial probing, and guardrails for runtime enforcement. Mechanistic interpretability — understanding what's actually happening inside model weights — is an active research frontier that could eventually let us verify alignment rather than just test for it. Labs like Anthropic, OpenAI, and DeepMind treat alignment as a core research priority, not an afterthought, because the stakes scale directly with capability.