Scaling Laws
What Are Scaling Laws?
Scaling laws are empirical mathematical relationships that describe how the performance of neural networks improves as key variables increase: model size (number of parameters), training data (number of tokens), and compute (total floating-point operations). First formalized by Jared Kaplan and colleagues at OpenAI in 2020, these power-law relationships revealed that cross-entropy loss decreases smoothly and predictably across more than seven orders of magnitude of scale. This discovery transformed artificial intelligence research from an empirical art into something closer to an engineering discipline, where performance gains can be forecast before a single training run begins.
From Kaplan to Chinchilla: Compute-Optimal Training
The original Kaplan scaling laws suggested that model size mattered most—leading labs to build ever-larger models while holding training data relatively constant. In 2022, DeepMind's Chinchilla paper overturned this assumption. By training over 400 models ranging from 70 million to 16 billion parameters on varying amounts of data, researchers demonstrated that for a given compute budget, model size and training tokens should be scaled equally. The practical implication was striking: a smaller model trained on more data could outperform a much larger undertrained model at the same cost. This insight reshaped how frontier labs like Google DeepMind, Anthropic, and Meta allocate their training budgets, and influenced architectures such as Meta's Llama 3, which pushed token-to-parameter ratios as high as 200:1—far beyond Chinchilla's original 20:1 recommendation.
Inference-Time Scaling: A New Frontier
By 2024–2025, a new dimension of scaling emerged: inference-time compute. Rather than only scaling resources during training, models like OpenAI's o1 and o3 reasoning series allocate additional compute at inference time, performing multiple reasoning passes to solve complex problems. Research has shown that scaling inference compute with advanced strategies can be more efficient than scaling model parameters alone, with smaller models combined with sophisticated inference algorithms offering superior cost-performance trade-offs. This shift is reshaping the GPU and semiconductor landscape, with inference demand projected to exceed training demand by over 100x, driving procurement toward inference-optimized hardware from companies like NVIDIA.
Economic and Strategic Implications
Scaling laws are the economic engine behind the modern AI industry. They provide the empirical justification for the hundreds of billions of dollars flowing into AI infrastructure—from data centers and chip fabrication to energy generation. Every major decision at frontier AI labs about model size, training budget, and data collection is informed by these equations. The predictability of scaling laws also creates a strategic dynamic: organizations that can project future capabilities based on planned compute investments gain a decisive planning advantage, making scaling laws central to the geopolitics of AI and the agentic economy. However, researchers increasingly recognize that raw parameter scaling alone faces diminishing returns, prompting exploration of Mixture of Experts architectures, synthetic data generation, and reinforcement learning from human feedback as complementary strategies for pushing the capability frontier.
Scaling Laws Beyond Language Models
While most prominently studied in large language models, scaling laws have been observed across multiple domains including vision models, speech recognition, generative AI for images and video, game-playing agents, and robotics. This universality suggests that scaling laws reflect something fundamental about how neural networks learn representations from data—a principle with profound implications for the future of artificial general intelligence and the broader trajectory of deep learning research.
Further Reading
- Scaling Laws for Neural Language Models (Kaplan et al., 2020) — The foundational OpenAI paper establishing power-law relationships between compute, data, parameters, and loss
- Training Compute-Optimal Large Language Models (Hoffmann et al., 2022) — The Chinchilla paper from DeepMind that redefined optimal training data-to-parameter ratios
- How Scaling Laws Drive Smarter, More Powerful AI — NVIDIA Blog — Accessible overview of how scaling laws shape AI hardware and infrastructure strategy
- Scaling Laws Literature Review — Epoch AI — Comprehensive academic review of the scaling laws research landscape
- Inference Scaling Laws: Compute-Optimal Inference for Problem-Solving (2024) — Research on scaling compute at inference time as an alternative to parameter scaling
- Scaling Laws for LLMs: From GPT-3 to o3 — Detailed walkthrough of how scaling laws evolved across model generations