Xunfei Jiang

Analysis the Design of Cluster Monitoring Systems (Summer Research 2022)

Student Participant: Carlos Arturo Morales Sierra

Cluster monitoring systems place an important role for cluster administrators to understand the status and performance of servers in cluster systems. Ganglia is a scalable distributed monitoring system for high-performance computing systems, and it has been used on thousands of clusters around the world [1]. It provides time-series views of tens of machine-related metrics such as CPU, memory, storage, network usage [2]. However, it does not provide customized support for users to monitor the temperature and energy consumption, and the data collection intervals could not be customized [3]. In addition, it does not support the monitoring of jobs or tasks which makes it hard for users to extract data for task management on cluster systems.

In this research, we conducted literature reviews to study existing cluster monitoring systems[1][4-6], and installed the Ganglia system to analyze the performance of the system and explore supporting metrics on this system.

We installed the Ganglia cluster monitoring system on a HPE BladeSystem C7000 cluster. This cluster is composed of 8* HP ProLiant BL460c Gen8 servers and 8* Graphic Expansion Blades). Each computer server has 2 Intel CPU chips with 10 cores in each chip, and 2 Intel SSD disks are configured as RAID 1 for mirror backup. The following shows the configuration of the computer servers.

Node Configuration CPU: 2 * Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz (10 Cores) Disk: 2 * Intel SSD DC S3520 480 GB Memory: 96 GB GPU: Nvidia Tesla P4 Operating System: Ubuntu 21.10 (impish) Blade Server: HP ProLiant BL460c Gen8 Blade Server
Table 1. Node Configuration of the HP Server

Analysis of Exisiting Cluster Monitoring Systems

In total six systems were researched, of which the focus was on Ganglia, PLMSys and the HPC system:

Ganglia [1]: Ganglia is a monitoring system for high performance computing, which was designed and implemented by students from the university of Berkeley California. Ganglia’s main functions are carried out by two types of daemons called gmond and gmetad. The gmond daemons are deployed in each of the computer systems being monitored to collecting metrics such as CPU usage, workload, network usage, memory usage, and more. Gmtad collects the metrics information from any node and saves it in a RRD database. PHP scripts then extract the information from the RRD and sent sent it to an apache server that converts the data into a webpage and makes the graphs dynamically for a user to interact with them.
PLMSys [2]: Its main job is to complement the existing monitoring systems as well as helping users and businesses monitor clusters and docker containers. It works via the following modules: The metrics acquisition module is installed on each host to gather data on its metrics as well as the metrics of the docker containers. This data ultimately ends in a monitoring database for storage. The container logging module retrieves the logs that docker stores from each container and organizes and sends them to the monitoring database for ‘centralized preservation’. The exception alerting module can be modified/set according to preference to alert for anomalies. The workload analysis module uses the time series model to provide descriptive statistics of the metrics retrieved from the containers and hosts. The PLMSys monitor system manages data transfer through the message querying system for increased efficiency and performance and uses OpenTSDB and MySQL to store performance metrics, data logs and platform settings.
HPC Monitoring System [3]: Four-tiered architecture, starts with the grid information detecting layer which obtains information from system’s daemons about memory usage, network traffic and more. Next in the Grid information Management layer, a shell program gathers all the data and passes it to a multicast channel that transmits periodic statuses of the nodes and keeps itself updated through listening threads. In the cluster information gateway layer, data is collected from nodes but combined in a “single cluster image” then pass back to layer 2. This layer also performs the function of transforming all data obtained into standard format that conforms to XLM format. Finally, in the cluster information interface layer users/administrators can engage with an easy-to-use user interface.

We installed the Ganglia system on our cluster to study what metrics are collected by this system, and what extension could be added for our research. Following is the steps to install the Ganglia system:

Install Apache web server on the server node
Install PHP modules on the server node
Install Ganglia gmond and gmetad daemons on the server node
Install Ganglia gmond daemon on the client nodes


Figure 1. The Ganglia System installed on our GreenComputing Cluster.	Figure 2. CPU and disk metrics visualized by Ganglia.


Figure 3. Load and memory metrics visualized by Ganglia.	Figure 4. Network and process metrics visualized by Ganglia.

Conclusion and Future Work

In conclusion, even though Ganglia is one of the oldest monitoring systems for clusters and grids, it is still a viable tool for cluster administrators. Through the research we saw that Ganglia has many custom metrics that can be added, it’s easy to install and its user interface is highly customizable and interactive. For example, we were able to add minutes to the ranges tab on the ganglia website by editing a PHP file. However, depending on the needs of the user/administrator it may be better to go with a newer monitoring system such as Munin or dataDog. Systems such as them are more adequate to handle containerized software which is highly used nowadays. As well as providing better security, alerting mechanisms, ease of access, creation of plug-ins and more.

Reference

[1] Massie, Matthew L., Brent N. Chun, and David E. Culler. "The ganglia distributed monitoring system: design, implementation, and experience." Parallel Computing 30, no. 7 (2004): 817-840.

[2] Massie, Matt, Bernard Li, Brad Nicholes, Vladimir Vuksan, Robert Alexander, Jeff Buchbinder, Frederiko Costa et al. Monitoring with Ganglia: tracking dynamic host and application metrics at scale. " O'Reilly Media, Inc.", 2012.

[3] Bhatnagar, Rakesh, and Jayesh Patel. "Performance Analysis of a Grid Monitoring System-Ganglia." International Journal of Emerging Technology and Advanced Engineering 3, no. 8 (2013): 362-365.

[4] Li, Min, and Yisheng Zhang. "Hpc cluster monitoring system architecture design and implementation." In 2009 Second International Conference on Intelligent Computation Technology and Automation, vol. 2, pp. 325-327. IEEE, 2009.

[5] Sun, Yongzhong, Kejiang Ye, and Cheng-Zhong Xu. "PLMSys: A Cloud Monitoring System Based on Cluster Performance and Container Logs." In International Conference on Cloud Computing, pp. 111-125. Springer, Cham, 2020.

Acknowledgements

Supported by The Louis Stokes Alliance for Minority Participation--LSAMP Program and funded by the National Science Foundation and the California State University System.