Knowledge Distillation

Knowledge distillation is a technique for transferring the capabilities of a large, powerful "teacher" model into a smaller, more efficient "student" model. The student learns not just from the original training data but from the teacher's output distributions—its soft predictions, reasoning patterns, and nuanced probability assignments—effectively compressing intelligence into a more deployable form.

The core idea, introduced by Geoffrey Hinton in 2015, exploits a crucial insight: a large model's wrong answers are informative. When a model trained on animal images says a horse picture is 80% horse, 15% zebra, and 5% donkey, those "soft" probabilities encode rich relational knowledge (horses are more like zebras than donkeys). Training a student on these soft targets transfers more information than hard labels alone. The student learns the teacher's understanding of similarity, uncertainty, and edge cases.

Distillation has become central to the AI deployment strategy. Frontier models with hundreds of billions of parameters are too expensive to run at scale for every query. Distilled versions—like the smaller variants of Llama, Gemma, and Phi—retain much of the capability at a fraction of the cost. Quantization reduces numerical precision; distillation reduces architectural size. Together they enable inference on phones, laptops, and edge devices that could never run the original teacher model.

The technique also powers the open-weight ecosystem's rapid improvement. Smaller labs can distill from frontier models (where licensing permits) to create specialized, efficient models for specific domains. This accelerates the cost deflation that defines the current AI landscape—the 92% drop in inference costs over three years is partly driven by distilled models replacing full-size ones in production workloads.

Knowledge Distillation

Related Topics

Further Reading