Tensor Processing Units
Tensor Processing Units (TPUs) are custom AI accelerator chips designed by Google specifically for machine learning workloads. First deployed internally in 2015, TPUs represent the most prominent example of a hyperscaler designing its own silicon to optimize AI computation rather than relying solely on general-purpose GPUs.
The architectural philosophy differs from GPUs. Where NVIDIA GPUs are flexible parallel processors that excel at many workloads, TPUs are more specialized: they're optimized for the dense matrix multiplications that dominate neural network training and inference. The core computation unit is a large systolic array — a grid of multiply-accumulate units that processes matrix operations in a highly regular, predictable pattern. This specialization trades flexibility for efficiency: TPUs perform fewer types of operations but execute them faster and more power-efficiently.
TPU evolution reflects the rapid progression of AI demands. TPU v1 (2015) was inference-only, powering Google Search's RankBrain. TPU v2/v3 added training support and were organized into pods of interconnected chips. TPU v4 (2022) scaled to 4,096 chips per pod connected by Google's custom ICI (Inter-Chip Interconnect). TPU v5p scales to 8,960 chips per pod with improved performance per watt. Trillium (TPU v6e), announced in late 2024, offers further improvements in compute density and energy efficiency.
Google uses TPUs to train and serve its own AI models, including Gemini and PaLM. They're also available through Google Cloud, where researchers and companies can rent TPU capacity for their own workloads. Frameworks like JAX and TensorFlow are optimized for TPU execution, while PyTorch support (via XLA compilation) has improved significantly.
The competitive dynamics between TPUs and NVIDIA GPUs shape the AI hardware market. Google's investment in custom silicon gives it cost advantages for its own AI services and reduces dependency on NVIDIA's supply-constrained products. Other hyperscalers have followed: Amazon's Trainium, Microsoft's Maia, and Meta's MTIA chips all reflect the same logic — custom silicon can deliver better performance per dollar and per watt for known workloads.
TPUs also demonstrate the value of co-designing hardware and software. Google's ability to optimize its AI frameworks, compilers, and training infrastructure specifically for TPU architecture creates efficiencies that general-purpose hardware can't match. This vertical integration model — where the chip maker also controls the software stack and the workload — parallels Apple's approach to mobile computing and suggests that the future of AI hardware is increasingly about system design rather than chip design alone.
Further Reading
- The State of AI Agents in 2026 — Jon Radoff