Are you seeking something simple rather than sophisticated? If so, you can use the controller local disk for StateSaveLocation and place a cron job (on the same node or somewhere else) to take that data out via e.g. rsync and put it where you need it (NFS?) for the backup control node to use if/when needed. That obviously introduces a time delay which might or might not be problematic depending on what kind of failures you are trying to protect from and with what level of guarantee you wish the HA would have: you will not be protected in every possible scenario. On the other hand, given the size of the cluster that might be adequate and it's basically zero effort, so it might be "good enough" for you.

On Tue, May 7, 2024 at 4:44 AM Pierre Abele via slurm-users <slurm-users@lists.schedmd.com> wrote:

Hi all,

I am looking for a clean way to set up Slurms native high availability
feature. I am managing a Slurm cluster with one control node (hosting
both slurmctld and slurmdbd), one login node and a few dozen compute
nodes. I have a virtual machine that I want to set up as a backup
control node.

The Slurm documentation says the following about the StateSaveLocation
directory:

> The directory used should be on a low-latency local disk to prevent file system delays from affecting Slurm performance. If using a backup host, the StateSaveLocation should reside on a file system shared by the two hosts. We do not recommend using NFS to make the directory accessible to both hosts, but do recommend a shared mount that is accessible to the two controllers and allows low-latency reads and writes to the disk. If a controller comes up without access to the state information, queued and running jobs will be cancelled. [1]

My question: How do I implement the shared file system for the
StateSaveLocation?

I do not want to introduce a single point of failure by having a single
node that hosts the StateSaveLocation, neither do I want to put that
directory on the clusters NFS storage since outages/downtime of the
storage system will happen at some point and I do not want that to cause
an outage of the Slurm controller.

Any help or ideas would be appreciated.

Best,
Pierre

[1] https://slurm.schedmd.com/quickstart_admin.html#Config

--
Pierre Abele, M.Sc.

HPC Administrator
Max-Planck-Institute for Evolutionary Anthropology
Department of Primate Behavior and Evolution

Deutscher Platz 6
04103 Leipzig

Room: U2.80
E-Mail: pierre_abele@eva.mpg.de
Phone: +49 (0) 341 3550 245

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com