[slurm-users] Intermittent "Not responding" status

Stradling, Alden Reid (ars9ac) ars9ac at virginia.edu
Mon Dec 4 11:57:59 MST 2017


I have a number of nodes that have, after our transition to Centos 7.3/SLURM 17.02, begun to occasionally display a status of "Not responding". The health check we run on each node every 5 minutes detects nothing, and the nodes are perfectly healthy once I set their state to "idle". The slurmd continues uninterrupted, and the nodes get jobs immediately after going back online.

Has anyone on this list seen similar behavior? I have increased logging to debug/verbose, but have seen no errors worth noting.

Cheers,

Alden

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4943 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20171204/d14df8e3/attachment.bin>


More information about the slurm-users mailing list