[slurm-users] Setting-up Monitoring System on Cluster : Using Nagios and Librenms

Djamil Lakhdar-Hamina dl2774 at columbia.edu
Wed Jun 22 14:04:06 UTC 2022

I am helping set up a 16 node cluster computing system, I am not a
system-admin but I work for a small firm and unfortunately have to pick up
needed skills fast in things I have little experience in. I am running
Rocky Linux 8 on Intel Xeon Knights Landings nodes. We are trying to
monitor the controller and compute nodes and are trying to track for
instance CPU usage, storage, networking, and *very importantly* power and
energy (we are operating in Uganda and energy is very expensive).

As part of testing, I have set-up a virtual dev environment using Vagrant (
https://github.com/Djamil17/openhpc-test-cluster). I have installed and
set-up Nagios as specified by the Rocky linux Slurm installation guide and
have been mucking around with config files with little success. Httpd,
nagios is itself running on the controller as reported by systemctl, nrpe
is running on compute nodes. However, when I try to access the web
interface via private ip, I cannot access the web interface and it does not
even prompt me to input user and password. When I input say
192.168.x.x/nagios the browser (Chrome) just hangs.

I have tried debugging and have used sources such as
little to no avail. All the following steps as reported in the link just
above have been checked, firewall, selinux, apache is running (though idk
much about apache), mysql is running. How does the config.cfg file need to
be configured? What about compute.cfg ? Does anyone have recommendations on
what services to track with nagios and librenms? Is nagios even the right
route or are there better solutions? I was wondering if anyone who had
extensive experience with Nagios and Librenms in the context of Slurm would
be willing to work with me to set-up the system.

Djamil Lakhdar-Hamina
