[slurm-users] Suspending jobs for file system maintenance

Tue Oct 19 19:06:14 UTC 2021

Dear all,

we are planning to perform some maintenance work on our Lustre file system
which may or may not harm running jobs. Although failover functionality is 
enabled on the Lustre servers we'd like to minimize risk for running jobs 
in case something goes wrong.  

Therefore, we thought about suspending all running jobs and resume
them as soon as file systems are back again. 

The idea would be to stop Slurm from scheduling new jobs as a first step: 

# for p in foo bar baz; do scontrol update PartitionName=$p State=DOWN; done

with foo, bar and baz being the configured partitions.

Then suspend all running jobs (taking job arrays into account):

# squeue -ho %A -t R | xargs -n 1 scontrol suspend

Then perform the failover of OSTs to another OSS server.
Once done, verify that file system is fully back and all 
OSTs are in place again on the client nodes.

Then resume all suspended jobs:

# squeue -ho %A -t S | xargs -n 1 scontrol resume

Finally bring back the partitions:

# for p in foo bar baz; do scontrol update PartitionName=$p State=UP; done

Does that make sense? Is that common practice? Are there any caveats that 
we must think about?

Thank you in advance for your thoughts.

Best regards
Jürgen