<div dir="ltr"><div>Dear Jurgen and Paul,</div><div><br></div><div>This is an interesting strategy, thanks for sharing. So if I read the scontrol man page correctly, `scontrol suspend` sends a SIGSTOP to all job processes. The processes remain in memory, but are paused. What happens to open file handles, since the underlying filesystem goes away and comes back?<br></div><div><br></div><div>Thank you,<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Oct 23, 2021 at 1:10 AM Juergen Salk <<a href="mailto:juergen.salk@uni-ulm.de">juergen.salk@uni-ulm.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Thanks, Paul, for confirming our planned approach. We did it that way<br>
and it worked very well. I have to admit that my fingers were a bit<br>
wet when suspending thousands of running jobs, but it worked without<br>
any problems. I just didn't dare to resume all suspended jobs at<br>
once, but did that in a staggered manner.<br>
<br>
Best regards<br>
Jürgen<br>
<br>
* Paul Edmon <<a href="mailto:pedmon@cfa.harvard.edu" target="_blank">pedmon@cfa.harvard.edu</a>> [211019 15:15]:<br>
> Yup, we follow the same process for when we do Slurm upgrades, this looks<br>
> analogous to our process.<br>
> <br>
> -Paul Edmon-<br>
> <br>
> On 10/19/2021 3:06 PM, Juergen Salk wrote:<br>
> > Dear all,<br>
> > <br>
> > we are planning to perform some maintenance work on our Lustre file system<br>
> > which may or may not harm running jobs. Although failover functionality is<br>
> > enabled on the Lustre servers we'd like to minimize risk for running jobs<br>
> > in case something goes wrong.<br>
> > <br>
> > Therefore, we thought about suspending all running jobs and resume<br>
> > them as soon as file systems are back again.<br>
> > <br>
> > The idea would be to stop Slurm from scheduling new jobs as a first step:<br>
> > <br>
> > # for p in foo bar baz; do scontrol update PartitionName=$p State=DOWN; done<br>
> > <br>
> > with foo, bar and baz being the configured partitions.<br>
> > <br>
> > Then suspend all running jobs (taking job arrays into account):<br>
> > <br>
> > # squeue -ho %A -t R | xargs -n 1 scontrol suspend<br>
> > <br>
> > Then perform the failover of OSTs to another OSS server.<br>
> > Once done, verify that file system is fully back and all<br>
> > OSTs are in place again on the client nodes.<br>
> > <br>
> > Then resume all suspended jobs:<br>
> > <br>
> > # squeue -ho %A -t S | xargs -n 1 scontrol resume<br>
> > <br>
> > Finally bring back the partitions:<br>
> > <br>
> > # for p in foo bar baz; do scontrol update PartitionName=$p State=UP; done<br>
> > <br>
> > Does that make sense? Is that common practice? Are there any caveats that<br>
> > we must think about?<br>
> > <br>
> > Thank you in advance for your thoughts.<br>
> > <br>
> > Best regards<br>
> > Jürgen<br>
> > <br>
<br>
</blockquote></div><br clear="all"><br>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div>Alan Orth<br><a href="mailto:alan.orth@gmail.com" target="_blank">alan.orth@gmail.com</a><br><a href="https://picturingjordan.com" target="_blank">https://picturingjordan.com</a><br><a href="https://englishbulgaria.net" target="_blank">https://englishbulgaria.net</a><br><a href="https://mjanja.ch" target="_blank">https://mjanja.ch</a></div></div></div>