[slurm-users] Best method to determine if a node is down

Marcus Boden mboden at gwdg.de
Sun Jun 27 20:02:14 UTC 2021

Hi Doug,

Slurm has the strigger[1] mechanism that can do exactly that, the 
manpage even has your use case as an example. It works quite well for us.


[1] https://slurm.schedmd.com/strigger.html

On 26.06.21 19:10, Doug Niven wrote:
> Hi Folks,
> I’d like to setup an email notification, perhaps via cron (unless there’s a better method) of notifying the sysadmin when a Slurm node is down and/or not firing off jobs...
> For example, using ‘squeue’ in NODELIST(REASON) I recently saw:
> (Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions)
> And using ‘sinfo’ I saw:
> % sinfo -Nl
> Fri May 07 08:49:26 2021
> trom         1    short*    draining 112    2:56:2 204800        0      1   (null) Kill task failed
> trom         1      long    draining 112    2:56:2 204800        0      1   (null) Kill task failed
> I’m not sure what would be the best value to grep for, as I suspect there are other states than DOWN or DRAINED that might mean a node is down and not firing off jobs?
> Thanks in advance for your ideas,
> Doug

Marcus Vincent Boden, M.Sc.
Arbeitsgruppe eScience, HPC-Team
Tel.:   +49 (0)551 201-2191, E-Mail: mboden at gwdg.de
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Am Faßberg 11, 37077 Göttingen, URL: https://www.gwdg.de

Support: Tel.: +49 551 201-1523, URL: https://www.gwdg.de/support
Sekretariat: Tel.: +49 551 201-1510, Fax: -2150, E-Mail: gwdg at gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Norbert Lossau
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5376 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210627/57860b17/attachment.bin>

More information about the slurm-users mailing list