[slurm-users] Recovering from network failures in Slurm (without killing or restarting active jobs)

Fri Aug 31 15:06:14 MDT 2018

Hey, folks. I’ve got a Slurm 17.02 cluster (RPMs provided by Bright Computing, if it matters) with both gigabit Ethernet and Infiniband interfaces. Twice in the last year, I’ve had a failure inside the stacked Ethernet switches that’s caused Slurm to lose track of node and job state. Jobs kept running as normal, since all file traffic is on the Infiniband network.

In both cases, I wasn’t able to cleanly recover. On the first outage, my attempt at recovery (pretty sure I forcibly drained and resumed the nodes) caused all active jobs to be killed, and then the next group of queued jobs to start. On the second outage, all active jobs were restarted from scratch, including truncating and overwriting any existing output. I think that involved my restarting slurmd or slurmctld services, but I’m not certain.

I’ve built a VM test environment with OpenHPC and Slurm 17.11 to simulate these kinds of failures, but haven’t been able to reproduce my earlier results. After a sufficiently long network outage, I get downed nodes with "Reason=Duplicate jobid”.

Basically, I’d like to know what the proper procedure is for recovering from this kind of outage in the Slurm control network without losing the output from running jobs. Not sure if I can easily add any redundancy in the Ethernet network, but I may be able to add in the Infiniband network for control if that’s supported. Thanks.

-- 
Mike Renfro  / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University