[slurm-users] Intermittent "Not responding" status
Paul Edmon
pedmon at cfa.harvard.edu
Mon Dec 4 12:04:21 MST 2017
I've seen this happen when there are internode communications issues
which disrupt the tree that slurm uses to talk to the nodes and do
heartbeat. We have this happen occassionally in our environment as we
have nodes that are two geographically seperate facilities and the
latency is substantial, thus the lag crossing back and for can add up. I
would check to see if all your nodes can talk to each other and the
master and if your Timeouts are set high enough.
-Paul Edmon-
On 12/04/2017 01:57 PM, Stradling, Alden Reid (ars9ac) wrote:
> I have a number of nodes that have, after our transition to Centos 7.3/SLURM 17.02, begun to occasionally display a status of "Not responding". The health check we run on each node every 5 minutes detects nothing, and the nodes are perfectly healthy once I set their state to "idle". The slurmd continues uninterrupted, and the nodes get jobs immediately after going back online.
>
> Has anyone on this list seen similar behavior? I have increased logging to debug/verbose, but have seen no errors worth noting.
>
> Cheers,
>
> Alden
>
More information about the slurm-users
mailing list