[slurm-users] Best method to determine if a node is down
dniven at ucsc.edu
Sat Jun 26 17:10:39 UTC 2021
I’d like to setup an email notification, perhaps via cron (unless there’s a better method) of notifying the sysadmin when a Slurm node is down and/or not firing off jobs...
For example, using ‘squeue’ in NODELIST(REASON) I recently saw:
(Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
And using ‘sinfo’ I saw:
% sinfo -Nl
Fri May 07 08:49:26 2021
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
trom 1 short* draining 112 2:56:2 204800 0 1 (null) Kill task failed
trom 1 long draining 112 2:56:2 204800 0 1 (null) Kill task failed
I’m not sure what would be the best value to grep for, as I suspect there are other states than DOWN or DRAINED that might mean a node is down and not firing off jobs?
Thanks in advance for your ideas,
More information about the slurm-users