[slurm-users] Slurm Node Unresponsive
Doug Meyer
dameyer99 at gmail.com
Wed Sep 9 01:46:43 UTC 2020
Hi,
Does scontrol ping from the node show the slurm server up? If so munge is
fine. Betting it is not this but it is such an easy check.
Ensure you have the same slurm.conf on master and client.
The fact you can restart the slurmd and all is well is really odd.
Suggests slurm is coming up too soon. It is possible your system clock is
not getting set to network time before slurmd starts. I have seen this
cause issues in the past.
Doug
On Tue, Sep 8, 2020 at 7:13 AM Grant Campbell <grant.campbell at mythic-ai.com>
wrote:
> Hey,
>
> I am running a Slurm cluster that I inherited from an employee who left,
> so you will have to forgive any ignorance on my part, I am still coming up
> to speed on some core concepts.
>
> I have a vexing issue where one slurm node becomes unresponsive
> consistently. Network and DNS seem to be working fine, but the control node
> logs "Nodes node3 not responding, setting DOWN ". If I mark the node as
> RESUME it comes back up, but no jobs can be scheduled, I have to restart
> the slurmd process to get it to work.
>
> I enabled debug logging on the troublesome node, and I see it logging
> errors like the below near constantly:
>
> [2020-09-08T09:02:35.189] [59921.0] error: Unable to establish controller
> machine
> [2020-09-08T09:02:40.584] [59924.0] error: Unable to establish controller
> machine
> [2020-09-08T09:03:02.550] [59923.extern] error: Unable to establish
> controller machine
> [2020-09-08T09:03:04.537] [59921.extern] error: Unable to establish
> controller machine
> [2020-09-08T09:03:09.474] [59924.extern] error: Unable to establish
> controller machine
>
> This of course seems problematic, though it should be noted I do not see
> the logging of these errors correlate with the outage chronologically at
> all -- as I said, they log near constantly.
>
> One final piece of context, this machine OOM'd last week, and this issue
> began after we brought it back up. As part of that process, I had to
> re-join it to IPA, so not sure if there is something there that could have
> caused this issue.
>
> Any help or advice would be much appreciated, thanks!
>
> Thanks!
>
> -Grant
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200908/27c42e9f/attachment.htm>
More information about the slurm-users
mailing list