[slurm-users] Recovering from network failures in Slurm (without killing or restarting active jobs)

Paul Edmon pedmon at cfa.harvard.edu
Fri Aug 31 19:18:46 MDT 2018


So there are different options you can set for Return to Service in the 
slurm.conf which can effect how the node is handled on reconnect.  You 
can also up the timeouts for the daemons.

-Paul Edmon-


On 8/31/2018 5:06 PM, Renfro, Michael wrote:
> Hey, folks. I’ve got a Slurm 17.02 cluster (RPMs provided by Bright Computing, if it matters) with both gigabit Ethernet and Infiniband interfaces. Twice in the last year, I’ve had a failure inside the stacked Ethernet switches that’s caused Slurm to lose track of node and job state. Jobs kept running as normal, since all file traffic is on the Infiniband network.
>
> In both cases, I wasn’t able to cleanly recover. On the first outage, my attempt at recovery (pretty sure I forcibly drained and resumed the nodes) caused all active jobs to be killed, and then the next group of queued jobs to start. On the second outage, all active jobs were restarted from scratch, including truncating and overwriting any existing output. I think that involved my restarting slurmd or slurmctld services, but I’m not certain.
>
> I’ve built a VM test environment with OpenHPC and Slurm 17.11 to simulate these kinds of failures, but haven’t been able to reproduce my earlier results. After a sufficiently long network outage, I get downed nodes with "Reason=Duplicate jobid”.
>
> Basically, I’d like to know what the proper procedure is for recovering from this kind of outage in the Slurm control network without losing the output from running jobs. Not sure if I can easily add any redundancy in the Ethernet network, but I may be able to add in the Infiniband network for control if that’s supported. Thanks.
>




More information about the slurm-users mailing list