[slurm-users] Suspending jobs for file system maintenance

Tue Oct 19 19:15:28 UTC 2021

Yup, we follow the same process for when we do Slurm upgrades, this 
looks analogous to our process.

-Paul Edmon-

On 10/19/2021 3:06 PM, Juergen Salk wrote:
> Dear all,
>
> we are planning to perform some maintenance work on our Lustre file system
> which may or may not harm running jobs. Although failover functionality is
> enabled on the Lustre servers we'd like to minimize risk for running jobs
> in case something goes wrong.
>
> Therefore, we thought about suspending all running jobs and resume
> them as soon as file systems are back again.
>
> The idea would be to stop Slurm from scheduling new jobs as a first step:
>
> # for p in foo bar baz; do scontrol update PartitionName=$p State=DOWN; done
>
> with foo, bar and baz being the configured partitions.
>
> Then suspend all running jobs (taking job arrays into account):
>
> # squeue -ho %A -t R | xargs -n 1 scontrol suspend
>
> Then perform the failover of OSTs to another OSS server.
> Once done, verify that file system is fully back and all
> OSTs are in place again on the client nodes.
>
> Then resume all suspended jobs:
>
> # squeue -ho %A -t S | xargs -n 1 scontrol resume
>
> Finally bring back the partitions:
>
> # for p in foo bar baz; do scontrol update PartitionName=$p State=UP; done
>
> Does that make sense? Is that common practice? Are there any caveats that
> we must think about?
>
> Thank you in advance for your thoughts.
>
> Best regards
> Jürgen
>
>
>