Liquid Cooling Delivering on the Promise
The demand for more efficient HPC liquid cooling in data centers is being driven by a number of factors. While the push to Exascale is a fundamental driver in the most extreme cases where power and cooling loom large as barriers, there is an overall need to cool server racks of higher power usage and heat density generally. This is especially the case in HPC where server racks are being seen with 30kW per rack or more and where individual nodes must deal with multiple processors of 150 watts augmented with GPUs and coprocessors of 200 watts or more.
Of course, liquid cooling is already used in “air cooled” data centers in using liquid to remove bulk-heat expelled from air cooling racks from the data center itself. Traditionally, most data centers and HPC sites bring liquid into the computer room via Computer Room AC (CRAC) or Computer Room Air Handler (CRAH) units and use this liquid to cool the air in the data center. The refrigerant that comes into the CRAC units is liquid and chilled water is used as the coolant in CRAH units. The inefficiency is rooted in the path heat travels from the server into this liquid.
Beyond addressing the power and heat density barriers, done correctly, liquid cooling results in CapEx avoidance by mitigating both data center physical expansion through increased rack density and in reducing infrastructure investments such as the need for chiller, HVAC and cooling plant build-outs. With cooling costs often using one third of data center energy, the compelling OpEx benefits are available by reducing both overall data center cooling and server power consumption, and even in enabling energy recovery.
Approaches to Liquid Cooling
Liquid cooling approaches can be divided into two groups: general-purpose and close-coupled.
General-purpose solutions move the air-cooling unit closer to standard air-cooled servers in their racks. These approaches include such things as heat transferring rear doors, sealed racks and in-row coolers.
The use of rear door, in-row and over-row liquid coolers as a liquid cooling solution focuses on reducing the cost of moving air by placing the air-cooling unit closer to the servers. For example, rear door coolers replace rear doors on server racks with a liquid cooled heat exchanger that transfers server heat into liquid when hot air exits the servers. Servers are still air-cooled and facilities liquid must be brought in at the same temperatures needed for CRAH units (<65 degrees F). That liquid then exits the data center at <80 degrees F, too cool for useful energy recovery.
An issue with these general-purpose solutions is that expensive chillers are still required and server fans still consume the same amount of energy. They only address part of the problem in providing overall bulk-heat room neutrality and still need cold water and continued investment in chiller infrastructure.
Closely-coupled solutions on the other hand bring cooling liquid directly to the high heat flux components within servers such as CPUs, GPUs and memory taking advantage of the superior heat transfer and capacity of liquids and reduce, or even eliminate, the need for expensive chiller infrastructure. Additionally there are power savings from reducing fan power.
This approach began in the mid 1980’s when liquid cooling was brought inside the supercomputer with systems like the Cray-2 and began resurgence over a decade ago, with systems like IBM’s Power 575.
Closely-coupled approaches of note today include such things as Direct Touch, Immersion and Direct to Chip.
“Direct Touch” has not gained significant traction partially due to cooling infrastructure still needed to cool the refrigerant to <61°F. This approach to liquid cooling replaces air heat sinks in the servers with ‘Heat Risers’ that transfer heat to skin of server chassis where cold plates between servers transfer heat to refrigerant so the heat can be removed from the building. This eliminates fans in the server and the need to move air around the data center for server cooling, but continues to require refrigerant cooling and reduces the capacity of a typical 42U rack to around 35 RUs due to the cold plates which is counter to HPC trends.
Immersion cooling solutions remove server heat by placing servers in tanks of dielectric fluid or filling custom servers with dielectric fluid. Challenges with this approach include the maintenance of servers, modification of servers with non-standard parts, large quantities of oil-based coolant in the data center and density issues with poor space utilization due to the server “racks” lying horizontally.
Direct-to-chip liquid cooling systems such as Asetek’s RackCDU D2C™ (Direct-to-Chip) hot water cooling is an approach that brings cooling liquid directly to the high heat flux components within servers such as CPUs, GPUs and memory. CPUs run quite hot (153°F to 185°F) and hotter still for memory and GPUs. The cooling efficiency of water (4000x air) allows D2C to cool with hot water. How water cooling allows the use of dry coolers rather than expensive chillers to cool the water returning from the servers. Removing CPU, GPU and memory heat with liquid also reduces the power required for server fans.
The RackCDU D2C solution is an extension to a standard rack (RackCDU) combined with direct-to-chip server coolers (D2C) in each server. Because RackCDU has quick connects for each server, facilities teams can remove/replace servers as they do today.
Much more efficient pumps replace fan energy in the data center and server, and hot water eliminates the need for chilling the coolant. D2C liquid cooling dramatically reduces chiller use, CRAH fan energy and server fan energy. By doing so it delivers IT equipment energy savings of up to 10%, cooling energy savings greater than 50% and rack density increases of 2.5x-5x times versus air-cooled data centers.
RackCDU D2C uses a distributed pumping model. The cooling plate/pump replaces the air heat sink on the CPUs or GPUs in the server. Each pump/cold plate has sufficient pumping power to cool the whole server, providing redundancy. Unlike centralized pumping systems requiring high pressures, the pressure needed to operate the system is very low making it an inherently more reliable system.
In addition, RackCDU includes a software suite providing monitoring and alerts for temperatures, flow, pressures and leak detection that can report into data center software suites.
Direct-to-Chip Liquid Cooling Adoption Increasing
Direct-to-chip hot water liquid cooling is showing significant momentum in usage models important to both HPC and Commercial data centers:
Mississippi State University (MSU) installed a Cray 300LC supercomputing cluster that incorporates Asetek’s D2C. Key in the purchase decision was the ability to increase computing capacity without buying new chillers and related equipment, and install more compute within a fixed CapEx budget.
Lawrence Berkeley National Laboratory (LBNL) has found that Asetek’s direct cooling technology not only showed cooling energy savings of over 50%, but also savings of 21% of total data center energy, benefiting OpEx.
At the University of Tromso (UIT) in Norway, the Stallo HPC cluster is targeting 70% IT energy re-use and district heating for it’s north of the Arctic Circle campus.
Beyond HPC, highly virtualized applications are being implemented with Asetek’s D2C at the U.S. Army’s Sparkman Center Data Center. The goals of this installation include 60% cooling energy savings and 2.5x consolidation within existing infrastructure, and 40% waste-heat recovery.
Delivering on the Promise
For HPC, Liquid Cooling Done Right™ must address the power, cooling and density demands required to support the drive to greater densities and looming Exascale systems. At the same time, much like the commercial segment, it must address serviceability, monitoring and redundancy.
Asetek has paid careful attention to these factors in designing RackCDU D2C liquid cooling and the adoption and increasing momentum of direct-to-chip reflects applicability of Asetek’s approach to support the needs of HPC and data centers generally.