Experiments and Results
Built-in temperature sensors were used to track the temperature of the air entering and exiting each computing node and the ambient temperature in the cluster room. Interior temperature sensors were used to detect the temperature of CPU, disk, memory, and GPU. We developed scripts to track the utilizations and temperatures of the key components (CPU, disk, memory, and GPU) in these cluster servers. Whetstone[5], a synthetic benchmark program, was used to generate CPU-intensive workloads for studying the thermal impacts of CPU.
Experiment Set 1: increase the number of the processes from 1 to 40 with each process launches a Whetstone program that drives one CPU core to run under full utilization. When CPU cores were in idle state, the average temperatures of the 2 chips were between 22~25°C. As shown in Figure. 2, while all CPU cores were running under full utilization, the average CPU core temperature went up to near 90°C, which exceeded the high temperature threshold (86°C). When every core was running 2 threads of Whetstone, the CPU power increased by near 240W. We analyzed the CPU average temperatures and applied regression models and a XGBoost[6] model to fit the CPU temperature. Figure 3 shows the comparison of the real measurements with predictions during the heat up and cool down phases for running one Whetstone benchmark.
Figure 2. Temperature of CPU and Power in one computer server. | Figure 3. Comparison of real measurements and estimations for CPU chip temperature. |
Experiment Set 2: modify the Whetstone program so utilization of a single core could be changed to different values based on a passed in parameter (loops). We conducted 11 experiments with one process by increasing the CPU single core utilization. As shown in Figure 4, the average temperatures of the 2 CPU chips both increased when the number of loops increased except for one experiment. The two CPU chips were not deployed next to each other. We analyzed the utilizations of all the CPU cores and observed that though there was only one thread running the Whetstone benchmark, it was executed on different cores in one experiment. When the benchmark was running longer on cores of one CPU chip, then the average of that chip will be higher. We repeated this set of experiments for several times, and had similar observations. Figure 5 shows that, compared with all CPU cores were idle, CPU power increased by near 30W when there was one thread of Whetstone running which drove a CPU core utilization to 89% on average.
Figure 4. CPU Average Temperature | Figure 5. CPU Power Increment |
Conclusion and Future Work
We conducted two sets of experiments to study the CPU average temperatures with different number of cores running under the same utilization and fixed number of cores running under different utilizations. By applying regression models and the XGBoost machine learning model, we observed that the XGBoost model has a better performance in CPU temperature predicting. In the future, we will conduct more experiments with various number of CPU cores running under different utilizations. Other machine learning algorithms will be compared with the XGBoost model in CPU temperature prediction. We will further investigate the thermal behaviors of other components (memory, disk, and GPU) and predict the server temperature with different types of workloads.
Reference
[1] Jin, Chaoqiang, Xuelian Bai, Chao Yang, Wangxin Mao, and Xin Xu. "A review of power consumption models of servers in data centers." applied energy 265 (2020): 114806.
[2] Ilager, Shashikant, Kotagiri Ramamohanarao, and Rajkumar Buyya. "Thermal prediction for efficient energy management of clouds using machine learning." IEEE Transactions on Parallel and Distributed Systems 32, no. 5 (2020): 1044-1056.
[3] Athavale, Jayati, Yogendra Joshi, and Minami Yoda. "Artificial neural network based prediction of temperature and flow profile in data centers." In 2018 17th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm), pp. 871-880. IEEE, 2018.
[4] Zhang, Kaicheng, Akhil Guliani, Seda Ogrenci-Memik, Gokhan Memik, Kazutomo Yoshii, Rajesh Sankaran, and Pete Beckman. "Machine learning-based temperature prediction for runtime thermal management across system components." IEEE Transactions on parallel and distributed systems 29, no. 2 (2017): 405-419.
Acknowledgements
Supported by The Louis Stokes Alliance for Minority Participation--LSAMP Program and funded by the National Science Foundation and the California State University System.