[slurm-users] 4 sockets but "
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Fri Jul 23 07:28:32 UTC 2021
Hi Loris,
On 7/23/21 9:05 AM, Loris Bennett wrote:
> We use both Zabbix and pestat. Zabbix gives us general information on
> the state of the nodes and file systems, and we have added some Slurm
> metrics, such as number of jobs pending, amount of memory pending,
> number of GPUs pending, etc. This has been quite handy, although I find
> Zabbix a bit tricky to configure. This maybe because (a) we are stuck
> on Version 3.4 due to the PHP dependency with CentOS 7 and (b) I only do
> stuff very irregularly with Zabbix and so always have to start somewhat
> from scratch.
I prefer simple tools, if possible :-) For monitoring Slurm compute
nodes, I'm fully satisfied with the LBNL Node Health Check tools. This
offers checks of disk space, memory, GPUs, Infiniband and much more. See
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#node-health-check
For monitoring the Slurm queue and pending jobs, I use the "showuserjobs"
script from
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserjobs
> pestat on the other hand gives us more information about what individual
> jobs on individual nodes are up to at a given point in time. I don't
> quite see how one could integrate pestat itself directly into Zabbix, as
> it is more geared to producing a report, but maybe Ole has ideas :-)
Sorry, no ideas because I'm not familiar with Zabbix.
/Ole
More information about the slurm-users
mailing list