[slurm-users] Intermittent "Not responding" status

Mon Dec 4 12:04:21 MST 2017

I've seen this happen when there are internode communications issues 
which disrupt the tree that slurm uses to talk to the nodes and do 
heartbeat.  We have this happen occassionally in our environment as we 
have nodes that are two geographically seperate facilities and the 
latency is substantial, thus the lag crossing back and for can add up. I 
would check to see if all your nodes can talk to each other and the 
master and if your Timeouts are set high enough.

-Paul Edmon-

On 12/04/2017 01:57 PM, Stradling, Alden Reid (ars9ac) wrote:
> I have a number of nodes that have, after our transition to Centos 7.3/SLURM 17.02, begun to occasionally display a status of "Not responding". The health check we run on each node every 5 minutes detects nothing, and the nodes are perfectly healthy once I set their state to "idle". The slurmd continues uninterrupted, and the nodes get jobs immediately after going back online.
>
> Has anyone on this list seen similar behavior? I have increased logging to debug/verbose, but have seen no errors worth noting.
>
> Cheers,
>
> Alden
>