[slurm-users] Node/job failures following scontrol reconfigure command
Baker D.J.
D.J.Baker at soton.ac.uk
Thu Oct 4 05:19:27 MDT 2018
Hello,
We have just finished an upgrade to slurm 18.08. My last task was to reset the slurmctld/slurmd timeouts to sensible values -- as they were set prior to the update. That is..
SlurmctldTimeout = 60 sec
SlurmdTimeout = 300 sec
With slurm <18.08 I've reconfigure the cluster many times before without an issues. Yesterday I found that this commands "pushed" most of the compute nodes into a "NODE_FAIL" state resulting in the loss of most running jobs.
I'm wondering if anyone has seen anything like this on their cluster, and if so what the solution was. I would be interested in hearing your experiences, please. Maybe I need to revise/increase the timeout values -- this sort of issue is tricky to test on an active cluster
Best regards,
David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181004/09e51890/attachment.html>
More information about the slurm-users
mailing list