Thermal Modeling of Cluster Systems using Machine Learning (Summer Research 2022)

Data Centers consume about 1% of worldwide electricity and the most recent study shows that the electricity consumption in data centers is still increasing. There is an urgent need for reducing the energy consumption of data centers, and workload management is one strategy that has been adopted by many data centers for energy consumption reduction. Energy modeling plays an important role in making decisions on workload balancing. Machine learning technology has become increasingly popular used in thermal and energy modeling [1].

In this research, we studied existing machine learning algorithms and methods used in thermal prediction [2-4] for data center or cluster systems, and compare the performance of these machine learning algorithms and methods. To investigate the impact of CPU activities on the temperature and energy consumption, we designed a set of experiments and developed programs to extract the temperature, energy, and performance data from our experimental results. Then, we applied several regression models to estimate the temperature and energy consumption to compare the performance of these methods.

We designed a group of experiments to study the energy consumption of the key components in a HPE BladeSystem C7000 cluster. This cluster is composed of 8* HP ProLiant BL460c Gen8 servers and 8* Graphic Expansion Blades). Each computer server has 2 Intel CPU chips with 10 cores in each chip, and 2 Intel SSD disks are configured as RAID 1 for mirror backup. Table 1 shows the configuration of the computer servers.

  • Node Configuration
    • CPU: 2 * Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz (10 Cores)
    • Disk: 2 * Intel SSD DC S3520 480 GB
    • Memory: 96 GB
    • GPU: Nvidia Tesla P4
  • Operating System: Ubuntu 21.10 (impish)
  • Blade Server: HP ProLiant BL460c Gen8 Blade Server
Table 1. Node Configuration of the HP Server Figure 1. Photo of Icess working on the research project

Experiments and Results

Built-in temperature sensors were used to track the temperature of the air entering and exiting each computing node and the ambient temperature in the cluster room. Interior temperature sensors were used to detect the temperature of CPU, disk, memory, and GPU. We developed scripts to track the utilizations and temperatures of the key components (CPU, disk, memory, and GPU) in these cluster servers. Whetstone[5], a synthetic benchmark program, was used to generate CPU-intensive workloads for studying the thermal impacts of CPU.

Experiment Set 1: increase the number of the processes from 1 to 40 with each process launches a Whetstone program that drives one CPU core to run under full utilization. When CPU cores were in idle state, the average temperatures of the 2 chips were between 22~25°C. As shown in Figure. 2, while all CPU cores were running under full utilization, the average CPU core temperature went up to near 90°C, which exceeded the high temperature threshold (86°C). When every core was running 2 threads of Whetstone, the CPU power increased by near 240W. We analyzed the CPU average temperatures and applied regression models and a XGBoost[6] model to fit the CPU temperature. Figure 3 shows the comparison of the real measurements with predictions during the heat up and cool down phases for running one Whetstone benchmark.

Figure 2. Temperature of CPU and Power in one computer server.       Figure 3. Comparison of real measurements and estimations for CPU chip temperature.

Experiment Set 2: modify the Whetstone program so utilization of a single core could be changed to different values based on a passed in parameter (loops). We conducted 11 experiments with one process by increasing the CPU single core utilization. As shown in Figure 4, the average temperatures of the 2 CPU chips both increased when the number of loops increased except for one experiment. The two CPU chips were not deployed next to each other. We analyzed the utilizations of all the CPU cores and observed that though there was only one thread running the Whetstone benchmark, it was executed on different cores in one experiment. When the benchmark was running longer on cores of one CPU chip, then the average of that chip will be higher. We repeated this set of experiments for several times, and had similar observations. Figure 5 shows that, compared with all CPU cores were idle, CPU power increased by near 30W when there was one thread of Whetstone running which drove a CPU core utilization to 89% on average.

Figure 4. CPU Average Temperature       Figure 5. CPU Power Increment

Conclusion and Future Work

We conducted two sets of experiments to study the CPU average temperatures with different number of cores running under the same utilization and fixed number of cores running under different utilizations. By applying regression models and the XGBoost machine learning model, we observed that the XGBoost model has a better performance in CPU temperature predicting. In the future, we will conduct more experiments with various number of CPU cores running under different utilizations. Other machine learning algorithms will be compared with the XGBoost model in CPU temperature prediction. We will further investigate the thermal behaviors of other components (memory, disk, and GPU) and predict the server temperature with different types of workloads.

Reference

[1] Jin, Chaoqiang, Xuelian Bai, Chao Yang, Wangxin Mao, and Xin Xu. "A review of power consumption models of servers in data centers." applied energy 265 (2020): 114806.

[2] Ilager, Shashikant, Kotagiri Ramamohanarao, and Rajkumar Buyya. "Thermal prediction for efficient energy management of clouds using machine learning." IEEE Transactions on Parallel and Distributed Systems 32, no. 5 (2020): 1044-1056.

[3] Athavale, Jayati, Yogendra Joshi, and Minami Yoda. "Artificial neural network based prediction of temperature and flow profile in data centers." In 2018 17th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm), pp. 871-880. IEEE, 2018.

[4] Zhang, Kaicheng, Akhil Guliani, Seda Ogrenci-Memik, Gokhan Memik, Kazutomo Yoshii, Rajesh Sankaran, and Pete Beckman. "Machine learning-based temperature prediction for runtime thermal management across system components." IEEE Transactions on parallel and distributed systems 29, no. 2 (2017): 405-419.

Acknowledgements

Supported by The Louis Stokes Alliance for Minority Participation--LSAMP Program and funded by the National Science Foundation and the California State University System.