1. Chip to Chiller Experimental Cooling Failure Analysis of Data Centers: The Interaction Between IT and Facility
- Author
-
Russell Tipton, Bahgat Sammakia, Mark Seymour, Husam A. Alissa, David Mendo, Kourosh Nemati, Dustin W. Demetriou, and Ken Schneebeli
- Subjects
Chiller ,Engineering ,business.industry ,020209 energy ,Airflow ,Intelligent Platform Management Interface ,02 engineering and technology ,021001 nanoscience & nanotechnology ,Industrial and Manufacturing Engineering ,Electronic, Optical and Magnetic Materials ,0202 electrical engineering, electronic engineering, information engineering ,Water cooling ,Data analysis ,System integration ,Instrumentation (computer programming) ,Central processing unit ,Electrical and Electronic Engineering ,0210 nano-technology ,business ,Simulation - Abstract
Cooling failure in data centers (DCs) is a complex phenomenon due to the many interactions between the cooling infrastructure and the information technology equipment (IT). To fully understand it, a system integration philosophy is vital to the testing and design of experiment. In this paper, a facility-level DC cooling failure experiment is run and analyzed. An airside cooling failure is introduced to the facility during two different cooling set points as well as in open and contained environments. Quantitative instrumentation includes pressure differentials, tile airflow, external contour and discrete air inlet temperature, intelligent platform management interface (IPMI), and cooling system data during failure recovery. Qualitative measurements include infrared imaging and airflow visualization via smoke trace. To our knowledge of current literature, this is the first experimental study in which an actual multi-aisle facility cooling failure is run with real IT (compute, network, and storage) load in the white space. This will establish a link between variations from the facility to the central processing unit (CPU). The results show that using the external IT inlet temperature sensors, the containment configuration shows a longer available uptime (AU) during failure. However, the IPMI data show the opposite. In fact, the available uptime is reduced significantly when the external sensors are compared to internal IT analytics. The response of the IT power, CPU temperature, and fan speed shows higher values during the containment failure. This occurs because of the instantaneous formation of external impedances in the containment during failure, which renders the contained aisle to be less resilient than the open aisle. The tradeoffs between PUE, OPEX, and AU are also explained.
- Published
- 2016