[slurm-users] 4 sockets but "

Loris Bennett loris.bennett at fu-berlin.de
Fri Jul 23 08:26:48 UTC 2021

Hi Ole,

Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk> writes:

> Hi Loris,
> On 7/23/21 9:05 AM, Loris Bennett wrote:
>> We use both Zabbix and pestat.  Zabbix gives us general information on
>> the state of the nodes and file systems, and we have added some Slurm
>> metrics, such as number of jobs pending, amount of memory pending,
>> number of GPUs pending, etc.  This has been quite handy, although I find
>> Zabbix a bit tricky to configure.  This maybe because (a) we are stuck
>> on Version 3.4 due to the PHP dependency with CentOS 7 and (b) I only do
>> stuff very irregularly with Zabbix and so always have to start somewhat
>> from scratch.
> I prefer simple tools, if possible :-)  For monitoring Slurm compute nodes, I'm
> fully satisfied with the LBNL Node Health Check tools.  This offers checks of
> disk space, memory, GPUs, Infiniband and much more.  See
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#node-health-check
> For monitoring the Slurm queue and pending jobs, I use the "showuserjobs" script
> from https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserjobs

For me the main benefit of tools like Zabbix is the historical
information they provide.  If I see that the nodes are all up, but the
number of cores in use is low, then that is often an indication that
people are overestimating their memory requirements.  Furthermore, if I
see the number of jobs waiting for GPUs or memory increasing over time,
then this might help me in deciding on some of the characteristics of
the next cluster.  However, such tools tend not to be simple :-(

>> pestat on the other hand gives us more information about what individual
>> jobs on individual nodes are up to at a given point in time.  I don't
>> quite see how one could integrate pestat itself directly into Zabbix, as
>> it is more geared to producing a report, but maybe Ole has ideas :-)
> Sorry, no ideas because I'm not familiar with Zabbix.

My bad.  Somehow the quoting got broken in the posting I was replying to
which made it look like you were using Zabbix, but in fact it was Diego.

Dr. Loris Bennett (Hr./Mr.)
ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de

More information about the slurm-users mailing list