[slurm-users] Suspending jobs for file system maintenance

Alan Orth alan.orth at gmail.com
Mon Oct 25 08:47:48 UTC 2021


Dear Jurgen and Paul,

This is an interesting strategy, thanks for sharing. So if I read the
scontrol man page correctly, `scontrol suspend` sends a SIGSTOP to all job
processes. The processes remain in memory, but are paused. What happens to
open file handles, since the underlying filesystem goes away and comes back?

Thank you,

On Sat, Oct 23, 2021 at 1:10 AM Juergen Salk <juergen.salk at uni-ulm.de>
wrote:

> Thanks, Paul, for confirming our planned approach. We did it that way
> and it worked very well. I have to admit that my fingers were a bit
> wet when suspending thousands of running jobs, but it worked without
> any problems. I just didn't dare to resume all suspended jobs at
> once, but did that in a staggered manner.
>
> Best regards
> Jürgen
>
> * Paul Edmon <pedmon at cfa.harvard.edu> [211019 15:15]:
> > Yup, we follow the same process for when we do Slurm upgrades, this looks
> > analogous to our process.
> >
> > -Paul Edmon-
> >
> > On 10/19/2021 3:06 PM, Juergen Salk wrote:
> > > Dear all,
> > >
> > > we are planning to perform some maintenance work on our Lustre file
> system
> > > which may or may not harm running jobs. Although failover
> functionality is
> > > enabled on the Lustre servers we'd like to minimize risk for running
> jobs
> > > in case something goes wrong.
> > >
> > > Therefore, we thought about suspending all running jobs and resume
> > > them as soon as file systems are back again.
> > >
> > > The idea would be to stop Slurm from scheduling new jobs as a first
> step:
> > >
> > > # for p in foo bar baz; do scontrol update PartitionName=$p
> State=DOWN; done
> > >
> > > with foo, bar and baz being the configured partitions.
> > >
> > > Then suspend all running jobs (taking job arrays into account):
> > >
> > > # squeue -ho %A -t R | xargs -n 1 scontrol suspend
> > >
> > > Then perform the failover of OSTs to another OSS server.
> > > Once done, verify that file system is fully back and all
> > > OSTs are in place again on the client nodes.
> > >
> > > Then resume all suspended jobs:
> > >
> > > # squeue -ho %A -t S | xargs -n 1 scontrol resume
> > >
> > > Finally bring back the partitions:
> > >
> > > # for p in foo bar baz; do scontrol update PartitionName=$p State=UP;
> done
> > >
> > > Does that make sense? Is that common practice? Are there any caveats
> that
> > > we must think about?
> > >
> > > Thank you in advance for your thoughts.
> > >
> > > Best regards
> > > Jürgen
> > >
>
>

-- 
Alan Orth
alan.orth at gmail.com
https://picturingjordan.com
https://englishbulgaria.net
https://mjanja.ch
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211025/1a46fee0/attachment.htm>


More information about the slurm-users mailing list