[slurm-users] Suspending jobs for file system maintenance
juergen.salk at uni-ulm.de
Tue Oct 19 19:06:14 UTC 2021
we are planning to perform some maintenance work on our Lustre file system
which may or may not harm running jobs. Although failover functionality is
enabled on the Lustre servers we'd like to minimize risk for running jobs
in case something goes wrong.
Therefore, we thought about suspending all running jobs and resume
them as soon as file systems are back again.
The idea would be to stop Slurm from scheduling new jobs as a first step:
# for p in foo bar baz; do scontrol update PartitionName=$p State=DOWN; done
with foo, bar and baz being the configured partitions.
Then suspend all running jobs (taking job arrays into account):
# squeue -ho %A -t R | xargs -n 1 scontrol suspend
Then perform the failover of OSTs to another OSS server.
Once done, verify that file system is fully back and all
OSTs are in place again on the client nodes.
Then resume all suspended jobs:
# squeue -ho %A -t S | xargs -n 1 scontrol resume
Finally bring back the partitions:
# for p in foo bar baz; do scontrol update PartitionName=$p State=UP; done
Does that make sense? Is that common practice? Are there any caveats that
we must think about?
Thank you in advance for your thoughts.
More information about the slurm-users