Reinforcement Learning
Reinforcement learning (RL) is a machine learning paradigm where an agent learns optimal behavior by interacting with an environment, receiving rewards or penalties for its actions, and iteratively improving its strategy to maximize cumulative reward.
RL has produced some of AI's most dramatic achievements. DeepMind's AlphaGo defeated the world Go champion in 2016. AlphaFold solved protein structure prediction. AlphaStar reached Grandmaster level in StarCraft II. These demonstrations showed that RL can master complex strategic domains where explicit programming is impractical—the agent discovers strategies that human experts hadn't considered.
Reinforcement learning from human feedback (RLHF) has become essential to making language models useful and safe. After pre-training on text data, models are fine-tuned using human preference signals—evaluators rank model outputs, and RL optimizes the model to produce responses humans prefer. This process transforms a raw text predictor into a helpful, harmless assistant. Variants like RLAIF (RL from AI feedback) and DPO (Direct Preference Optimization) are reducing the cost and complexity of alignment.
For AI agents, RL provides the learning framework for autonomous behavior. When agents interact with environments—browsing the web, writing code, managing tasks—they need strategies for when to explore new approaches versus exploit known solutions. RL-trained agents can learn complex multi-step workflows, adapt to novel situations, and improve with experience. The autonomous task horizon's growth to 14.5 hours reflects RL-enabled agents that can sustain productive work over extended periods through learned strategies for planning, error recovery, and resource management.