[slurm-users] lots of job failed due to node failure

肖正刚 guru.novice at gmail.com
Wed Jul 22 00:40:56 UTC 2020

We run slurm 19.05 on a cluster about 1k nodes,recently, we found lots of
job failed due to node failure; check slumctld.log we found  nodes are set
to down stat then resumed quikly.
some log info:
[2020-07-20T00:21:23.306] error: Nodes j[1608,1802] not responding
[2020-07-20T00:22:27.486] error: Nodes j1608 not responding, setting DOWN
[2020-07-20T00:26:23.725] error: Nodes j1802 not responding
[2020-07-20T00:26:27.323] error: Nodes j1802 not responding, setting DOWN
[2020-07-20T00:26:46.602] Node j1608 now responding
[2020-07-20T00:26:49.449] Node j1802 now responding

Anyone hit this issue beforce ?
Any suggestions will help.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200722/b03f53b0/attachment-0001.htm>

More information about the slurm-users mailing list