[slurm-users] Suspending jobs for file system maintenance

Fri Oct 22 22:07:37 UTC 2021

Thanks, Paul, for confirming our planned approach. We did it that way
and it worked very well. I have to admit that my fingers were a bit
wet when suspending thousands of running jobs, but it worked without
any problems. I just didn't dare to resume all suspended jobs at
once, but did that in a staggered manner.

Best regards
Jürgen

* Paul Edmon <pedmon at cfa.harvard.edu> [211019 15:15]:
> Yup, we follow the same process for when we do Slurm upgrades, this looks
> analogous to our process.
> 
> -Paul Edmon-
> 
> On 10/19/2021 3:06 PM, Juergen Salk wrote:
> > Dear all,
> > 
> > we are planning to perform some maintenance work on our Lustre file system
> > which may or may not harm running jobs. Although failover functionality is
> > enabled on the Lustre servers we'd like to minimize risk for running jobs
> > in case something goes wrong.
> > 
> > Therefore, we thought about suspending all running jobs and resume
> > them as soon as file systems are back again.
> > 
> > The idea would be to stop Slurm from scheduling new jobs as a first step:
> > 
> > # for p in foo bar baz; do scontrol update PartitionName=$p State=DOWN; done
> > 
> > with foo, bar and baz being the configured partitions.
> > 
> > Then suspend all running jobs (taking job arrays into account):
> > 
> > # squeue -ho %A -t R | xargs -n 1 scontrol suspend
> > 
> > Then perform the failover of OSTs to another OSS server.
> > Once done, verify that file system is fully back and all
> > OSTs are in place again on the client nodes.
> > 
> > Then resume all suspended jobs:
> > 
> > # squeue -ho %A -t S | xargs -n 1 scontrol resume
> > 
> > Finally bring back the partitions:
> > 
> > # for p in foo bar baz; do scontrol update PartitionName=$p State=UP; done
> > 
> > Does that make sense? Is that common practice? Are there any caveats that
> > we must think about?
> > 
> > Thank you in advance for your thoughts.
> > 
> > Best regards
> > Jürgen
> >