Are you seeking something simple rather than sophisticated? If so, you can use the controller local disk for StateSaveLocation and place a cron job (on the same node or somewhere else) to take that data out via e.g. rsync and put it where you need it (NFS?) for the backup control node to use if/when needed. That obviously introduces a time delay which might or might not be problematic depending on what kind of failures you are trying to protect from and with what level of guarantee you wish the HA would have: you will not be protected in every possible scenario. On the other hand, given the size of the cluster that might be adequate and it's basically zero effort, so it might be "good enough" for you.
On Tue, May 7, 2024 at 4:44 AM Pierre Abele via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi all,
I am looking for a clean way to set up Slurms native high availability feature. I am managing a Slurm cluster with one control node (hosting both slurmctld and slurmdbd), one login node and a few dozen compute nodes. I have a virtual machine that I want to set up as a backup control node.
The Slurm documentation says the following about the StateSaveLocation directory:
The directory used should be on a low-latency local disk to prevent file
system delays from affecting Slurm performance. If using a backup host, the StateSaveLocation should reside on a file system shared by the two hosts. We do not recommend using NFS to make the directory accessible to both hosts, but do recommend a shared mount that is accessible to the two controllers and allows low-latency reads and writes to the disk. If a controller comes up without access to the state information, queued and running jobs will be cancelled. [1]
My question: How do I implement the shared file system for the StateSaveLocation?
I do not want to introduce a single point of failure by having a single node that hosts the StateSaveLocation, neither do I want to put that directory on the clusters NFS storage since outages/downtime of the storage system will happen at some point and I do not want that to cause an outage of the Slurm controller.
Any help or ideas would be appreciated.
Best, Pierre
[1] https://slurm.schedmd.com/quickstart_admin.html#Config
-- Pierre Abele, M.Sc.
HPC Administrator Max-Planck-Institute for Evolutionary Anthropology Department of Primate Behavior and Evolution
Deutscher Platz 6 04103 Leipzig
Room: U2.80 E-Mail: pierre_abele@eva.mpg.de Phone: +49 (0) 341 3550 245
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com