RLHF

Reinforcement Learning from Human Feedback (RLHF) is the training technique that transforms raw language models—which are fundamentally next-token predictors—into systems that are helpful, harmless, and honest. It's the bridge between a model that can write anything and a model that writes what you actually want.

The process works in stages. First, a base model is pre-trained on vast text corpora to learn language patterns. Then human annotators compare pairs of model outputs and indicate which response is better. These preferences train a separate "reward model" that learns to predict what humans would prefer. Finally, the language model is fine-tuned using reinforcement learning (typically Proximal Policy Optimization, or PPO) to maximize the reward model's scores while staying close to its original behavior.

RLHF was the breakthrough that made ChatGPT possible. OpenAI's InstructGPT paper (2022) showed that a relatively small amount of human feedback could dramatically improve a model's usefulness. The technique was quickly adopted across the industry—Anthropic's Claude, Google's Gemini, and Meta's Llama all use variants. It remains a cornerstone of AI safety research, as it provides a mechanism (however imperfect) for encoding human values into AI systems.

The field has since evolved beyond classic RLHF. Direct Preference Optimization (DPO) eliminates the need for a separate reward model. Reinforcement Fine-Tuning uses verifiable rewards rather than human judgments. Constitutional AI uses AI feedback instead of human feedback. But RLHF remains the foundational concept—the insight that human preferences can be used as a training signal to align AI behavior with human intentions.

RLHF

Related Topics

Further Reading