Gradient Descent

Gradient descent is the optimization algorithm at the heart of virtually all modern AI training. It works by iteratively computing how wrong a model's predictions are (the loss), calculating which direction to adjust each parameter to reduce that error (the gradient), and taking a small step in that direction. Repeated billions of times across trillions of data points, this simple process produces the emergent intelligence of large language models.

The mathematical intuition is straightforward: imagine standing on a hilly landscape in fog, trying to find the lowest valley. You can't see the whole terrain, but you can feel which direction slopes downward at your feet. Gradient descent takes a step downhill, checks again, and repeats. The "landscape" is the loss function—a mathematical surface defined by all the model's parameters (billions of them for modern LLMs)—and the algorithm navigates toward parameter configurations that minimize prediction errors.

In practice, modern AI uses stochastic gradient descent (SGD) and its variants (Adam, AdaGrad, RMSProp). Rather than computing gradients over the entire dataset (computationally prohibitive for trillion-token corpora), SGD estimates gradients from small random batches. The Adam optimizer, which adapts learning rates per-parameter based on gradient history, has become the default for training transformer models. The choice of optimizer, learning rate schedule, batch size, and other hyperparameters can make the difference between a successful training run and wasted millions in compute.

What's remarkable is the gap between the simplicity of gradient descent and the complexity of what it produces. Reasoning models that can solve olympiad problems, generative systems that create photorealistic images, agents that write software—all emerge from this basic optimization loop. Understanding gradient descent is understanding the engine that powers the AI revolution.

Gradient Descent

Related Topics

Further Reading