Scaling Laws

What Are Scaling Laws?

Scaling laws are empirical mathematical relationships that describe how the performance of neural networks improves as key variables increase: model size (number of parameters), training data (number of tokens), and compute (total floating-point operations). First formalized by Jared Kaplan and colleagues at OpenAI in 2020, these power-law relationships revealed that cross-entropy loss decreases smoothly and predictably across more than seven orders of magnitude of scale. This discovery transformed artificial intelligence research from an empirical art into something closer to an engineering discipline, where performance gains can be forecast before a single training run begins.

From Kaplan to Chinchilla: Compute-Optimal Training

The original Kaplan scaling laws suggested that model size mattered most—leading labs to build ever-larger models while holding training data relatively constant. In 2022, DeepMind's Chinchilla paper overturned this assumption. By training over 400 models ranging from 70 million to 16 billion parameters on varying amounts of data, researchers demonstrated that for a given compute budget, model size and training tokens should be scaled equally. The practical implication was striking: a smaller model trained on more data could outperform a much larger undertrained model at the same cost. This insight reshaped how frontier labs like Google DeepMind, Anthropic, and Meta allocate their training budgets, and influenced architectures such as Meta's Llama 3, which pushed token-to-parameter ratios as high as 200:1—far beyond Chinchilla's original 20:1 recommendation.

Inference-Time Scaling: A New Frontier

By 2024–2025, a new dimension of scaling emerged: inference-time compute. Rather than only scaling resources during training, models like OpenAI's o1 and o3 reasoning series allocate additional compute at inference time, performing multiple reasoning passes to solve complex problems. Research has shown that scaling inference compute with advanced strategies can be more efficient than scaling model parameters alone, with smaller models combined with sophisticated inference algorithms offering superior cost-performance trade-offs. This shift is reshaping the GPU and semiconductor landscape, with inference demand projected to exceed training demand by over 100x, driving procurement toward inference-optimized hardware from companies like NVIDIA.

Economic and Strategic Implications

Scaling laws are the economic engine behind the modern AI industry. They provide the empirical justification for the hundreds of billions of dollars flowing into AI infrastructure—from data centers and chip fabrication to energy generation. Every major decision at frontier AI labs about model size, training budget, and data collection is informed by these equations. The predictability of scaling laws also creates a strategic dynamic: organizations that can project future capabilities based on planned compute investments gain a decisive planning advantage, making scaling laws central to the geopolitics of AI and the agentic economy. However, researchers increasingly recognize that raw parameter scaling alone faces diminishing returns, prompting exploration of Mixture of Experts architectures, synthetic data generation, and reinforcement learning from human feedback as complementary strategies for pushing the capability frontier.

Scaling Laws Beyond Language Models

While most prominently studied in large language models, scaling laws have been observed across multiple domains including vision models, speech recognition, generative AI for images and video, game-playing agents, and robotics. This universality suggests that scaling laws reflect something fundamental about how neural networks learn representations from data—a principle with profound implications for the future of artificial general intelligence and the broader trajectory of deep learning research.

Further Reading