[slurm-users] Best method to determine if a node is down

Doug Niven dniven at ucsc.edu
Sat Jun 26 17:10:39 UTC 2021


Hi Folks,

I’d like to setup an email notification, perhaps via cron (unless there’s a better method) of notifying the sysadmin when a Slurm node is down and/or not firing off jobs...

For example, using ‘squeue’ in NODELIST(REASON) I recently saw:

(Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)

And using ‘sinfo’ I saw:

% sinfo -Nl
Fri May 07 08:49:26 2021
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON              
trom         1    short*    draining 112    2:56:2 204800        0      1   (null) Kill task failed    
trom         1      long    draining 112    2:56:2 204800        0      1   (null) Kill task failed    

I’m not sure what would be the best value to grep for, as I suspect there are other states than DOWN or DRAINED that might mean a node is down and not firing off jobs?

Thanks in advance for your ideas,

Doug




More information about the slurm-users mailing list