Hi all,
I am looking for a clean way to set up Slurms native high availability feature. I am managing a Slurm cluster with one control node (hosting both slurmctld and slurmdbd), one login node and a few dozen compute nodes. I have a virtual machine that I want to set up as a backup control node.
The Slurm documentation says the following about the StateSaveLocation directory:
The directory used should be on a low-latency local disk to prevent file system delays from affecting Slurm performance. If using a backup host, the StateSaveLocation should reside on a file system shared by the two hosts. We do not recommend using NFS to make the directory accessible to both hosts, but do recommend a shared mount that is accessible to the two controllers and allows low-latency reads and writes to the disk. If a controller comes up without access to the state information, queued and running jobs will be cancelled. [1]
My question: How do I implement the shared file system for the StateSaveLocation?
I do not want to introduce a single point of failure by having a single node that hosts the StateSaveLocation, neither do I want to put that directory on the clusters NFS storage since outages/downtime of the storage system will happen at some point and I do not want that to cause an outage of the Slurm controller.
Any help or ideas would be appreciated.
Best, Pierre
[1] https://slurm.schedmd.com/quickstart_admin.html#Config