[slurm-users] In high availability scenario, what is the best way to synchronize state files with scontrol takeover command?
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Mon Apr 19 10:03:30 UTC 2021
Hi wenxiaoll at 126.com,
I think it is safer to get some experience with Slurm *without* using
initially a High Availability setup for the slurmctld server.
I highly recommend you to study the SchedMD presentations available in the
page https://slurm.schedmd.com/publications.html. In particular, the
paper from 2018:
* Technical: Field Notes Mark 2: Random Musings From Under A New Hat, Tim
Wickberg, SchedMD
The pages from page 26 "Cluster Architecture - Typical Linux Cluster"
discuss the Slurm High Availability setup.
Please note that for High Availability slurmctld, you must configure the
SaveStateLocation directory on a separate High Availability storage system
which can be mounted by both slurmctld hosts.
/Ole
On 4/19/21 10:39 AM, 刘文晓 wrote:
> There is a problem when dealing with Slurm's high availability.
> Now, In my env, I store the state file in the local hard disk for Ctld
> nodes, and use a shell script referencing the output of "scontrol ping" to
> sync files with interval time (2s, if making the time shorter then it will
> influence the server throughput),
>
> When I test Slurm HA, found it will use about configured time in
> slurm.conf to do the HA action by heartbeat method,
> but it will cost between 2.5s to 3s, with the command "scontrol takeover 1".
>
> The shell script method will work well in scenario 1.
> But In the second scenario, I found it is not a good way for
> synchronizing the state file from the main Ctld to the new main Ctld.
>
> I have several questions at below:
> 1. what's your favorite way to do HA dealing with state files? On the
> Slurm website, I did not find useful messages.
> 2. what's the best way with a shell script to sync state files? I go
> through the code about parameters of "SlurmctldPrimaryOffProg" and
> "SlurmctldPrimaryOnProg", found the OffProg is better to do do the last
> time sync operation, is my idea ok for this scenario?
More information about the slurm-users
mailing list