|
Cluster monitoring systems track status of various components (e.g. memory, I/O) of cluster clients. Cluster monitoring systems are used by job management systems for making decisions at which clients' jobs should be performed. Information provided by the cluster monitoring system is used by administrators for spotting errors or bottlenecks in cluster operations, and for collecting statistics on cluster utilization.
Basic usages of monitoring system are:
- troubleshooting and recovery of operations of system components
- performance analysis and enhancements of system operations
- support to the component performing job scheduling during the process of making decisions
- support to the component performing resource management (e.g. when accomplishing load balancing)
- collection of information on application
- Security threats and ‘’holes’’ detection.
Desirable features of cluster monitoring system are the following:
- non-intrusiveness
- integration with job management systems
- graphic interface (standard GUI or web portal)
- system usage statistics retrieval
- availability in cluster distributions
- feature for linking more API clusters.
The following cluster monitoring systems are described bellow: Ganglia, Supermon, Clumon and Hawkeye.
Ganglia
Ganglia is a system for monitoring a group of computers. Information on computes state is written in XML form. In that way, data received from Ganglia can be used. Furthermore, Ganglia can be used for monitoring in Grid because it can be linked to several Ganglia clusters.
Ganglia has the following architecture: Ganglia consists of three modules: gmond (Ganglia Monitoring Daemon), gmetad (Ganglia Meta Daemon) and Ganglia Web portal. At each client there is a gmond service which collects data about computer components. Data is saved into an XML document and are sent to a multicast address. Gmetad module is located at a chosen computer. Gmetad module's task is to collect data from a chosen group of nodes. Gmetad service can be configured in such a way that is sends data it received to the multicast and that it collects data from other gmetad services. In that way, it is possible to establish monitoring of several clusters. Web portal is placed at the same computer as gmetad. Ganglia web portal allows users to show graphs of data which are collected by gmond services.
Advantages of Ganglia are as follows:
- intuitive interface in form of a web portal
- data shown in XML form
- scalability
- possibility of linking several clusters
- simple installation
- possibility of executing commands on the computers which are monitored
- Ganglia is a part of Rocks, OSCAR and Warewulf cluster distributions
- API.
Disadvantages of Ganglia are:
- writing data in XML form requires computationally demanding parsing, and it can cause significant network workload.
Clumon
Clumon is a cluster monitoring system which uses Silicon Graphics Performance Co-Pilot, whose basic characteristic is integration with OpenPBS job management system. Clumon consists of Performance Co Pilot (PCP) services, almond service, relational database and web portal. PCP services are installed onto clients, and they track status of certain resources. Clumond service is installed onto servers, which collects data from PCP service and it stores them into the database. Alongside to the PCP service, Clumon collects data about jobs from the PBS server. Clients status, processes on clients and PBS managed jobs are monitored using the web portal. Clumon web portal allows retrieval of detailed information about PBS jobs and graphic display of the job schedules by clients.
Advantages of Clumon are:
- integration with PBS job management system
- display of processes at certain clients
- possibility of installation arbitrary sensors
- API.
Disadvantages of Clumon are:
- it is not possible to link several clusters
- graphic image only for load parameter
- non-portable implementation
Supermon
Supermon is a system intended for cluster clients monitoring whose basic feature is that it causes insignificant client load. It uses S-expressions for describing status of certain clients components. S-expressions are constructs deducted from Prolog program language. Supermon is accomplished as a module of the OS core which enables fast operations. It is currently in the development phase.
Advantages of Supermon are:
- insignificant load because of realization in from of core OS module
- possibility of linking several clusters
- possibility of installation of arbitrary sensors
- part of Clustermatic cluster distribution.
Disadvantages of Supermon are:
- using non-standard format for data display
- it is not offered with well known cluster distributions
- scanty documentation
- C version using RPC makes the system non-transferable.
Hawkeye
Hawkeye is a cluster monitoring system developed as a part of Condor job management system. Main purpose of Hawkeye is Condor cluster monitoring. Hawkeye consists of Monitoring Agent and Monitoring Manager. Monitoring Agent is placed at clients of the Condor cluster. Monitoring Agent collects data on client status and, using the same description form as other Condor services – ClassAds, sends information to Monitoring Manager. Monitoring Manager uses database for storing data about individual computers. Users are using command interface, graphic interface and web portal to access data.
Advantages of Hawkeye are:
- integration with Condor job management system
- possibility of installing arbitrary sensors
- possibility of performing commands at the computers that are being monitored
- is supports the same platforms as Condor
- API.
Disadvantages of Hawkeye are:
- modest interface for display of statistics.
|