[slurm-users] StateSaveLocation and Slurm HA

7 May 2024


      Hi all,
I am looking for a clean way to set up Slurms native high availability 
feature. I am managing a Slurm cluster with one control node (hosting 
both slurmctld and slurmdbd), one login node and a few dozen compute 
nodes. I have a virtual machine that I want to set up as a backup 
control node.
The Slurm documentation says the following about the StateSaveLocation 
directory:
...
The directory used should be on a low-latency local disk to prevent file system delays from affecting Slurm performance. If using a backup host, the StateSaveLocation should reside on a file system shared by the two hosts. We do not recommend using NFS to make the directory accessible to both hosts, but do recommend a shared mount that is accessible to the two controllers and allows low-latency reads and writes to the disk. If a controller comes up without access to the state information, queued and running jobs will be cancelled. [1]
My question: How do I implement the shared file system for the 
StateSaveLocation?
I do not want to introduce a single point of failure by having a single 
node that hosts the StateSaveLocation, neither do I want to put that 
directory on the clusters NFS storage since outages/downtime of the 
storage system will happen at some point and I do not want that to cause 
an outage of the Slurm controller.
Any help or ideas would be appreciated.
Best,
Pierre
[1] https://slurm.schedmd.com/quickstart_admin.html#Config
-- 
Pierre Abele, M.Sc.

HPC Administrator
Max-Planck-Institute for Evolutionary Anthropology
Department of Primate Behavior and Evolution

Deutscher Platz 6
04103 Leipzig

Room: U2.80
E-Mail: pierre_abele@eva.mpg.de
Phone: +49 (0) 341 3550 245

2025

2024

[slurm-users] StateSaveLocation and Slurm HA