Liquid Cooling

Liquid cooling in the context of AI infrastructure refers to thermal management systems that use liquids (typically water, specialized coolants, or dielectric fluids) to remove heat from high-power computing equipment. As AI datacenters push rack power densities beyond what air cooling can handle, liquid cooling has transitioned from a niche technology to an essential component of AI infrastructure.

The thermal challenge is straightforward arithmetic. A rack of eight NVIDIA H100 GPUs draws roughly 10-12 kW. A rack of NVIDIA GB200 NVL72 systems can draw 120+ kW. Traditional air-cooled datacenters are designed for 5-15 kW per rack. At 100+ kW, there simply isn't enough airflow to remove the heat — the air temperature at the back of the rack would exceed safe operating limits regardless of how much air conditioning is provided.

Several liquid cooling approaches are deployed at scale. Direct-to-chip (cold plate) cooling brings liquid directly to the processor via a metal cold plate mounted on the chip. Water or coolant flows through channels in the cold plate, absorbing heat with far greater efficiency than air. This is the most common approach for GPU clusters, used in NVIDIA's DGX and HGX systems. The liquid carries heat to building-level cooling infrastructure (cooling towers, chillers, or heat exchangers).

Rear-door heat exchangers attach liquid-cooled heat exchangers to the back of standard server racks, capturing heat from exhaust air before it enters the room. This is a less disruptive retrofit option for existing datacenters but less effective at extreme power densities.

Immersion cooling submerges entire servers in tanks of dielectric fluid (non-conductive liquid). Single-phase immersion uses fluids that remain liquid, circulated through external heat exchangers. Two-phase immersion uses fluids that boil at low temperatures — the servers literally bubble as heat is absorbed through phase change, which is extraordinarily efficient. Companies like GRC, LiquidCool Solutions, and Submer specialize in immersion systems.

The economics of liquid cooling are reshaping datacenter design. Liquid-cooled facilities can achieve higher compute density per square foot (reducing real estate costs), lower energy consumption for cooling (PUE approaching 1.05 versus 1.3-1.6 for air-cooled facilities), and operate in warmer climates where air cooling is less effective. The waste heat from liquid-cooled systems can be recaptured for building heating, district heating, or industrial processes — turning the energy consumption problem into a resource.

As AI accelerator power consumption continues to climb with each generation, liquid cooling is no longer optional for frontier AI training facilities. It's a structural requirement that affects every aspect of datacenter design, from plumbing to electrical distribution to site selection.

Liquid Cooling

Related Topics

Further Reading