[slurm-users] Migrate the slurmdbd service to another server
loris.bennett at fu-berlin.de
Mon Mar 4 15:30:24 UTC 2019
Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk> writes:
> On 3/4/19 2:26 PM, Loris Bennett wrote:
>> Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk> writes:
>>> We're one of the many Slurm sites which run the slurmdbd database daemon on the
>>> same server as the slurmctld daemon. This works without problems at our site
>>> given our modest load, however, SchedMD recommends to run the daemons on
>>> separate servers.
>>> Contemplating how to upgrade our cluster from Slurm 17.11 to 18.08, I've come to
>>> appreciate the advantage of running the daemons on separate servers: One can
>>> upgrade slurmdbd to 18.08 while keeping slurmctld at 17.11 (for a while at
>>> least). This enables us to upgrade to 18.08 in the recommended order without
>>> any interruption to our running jobs and without any cluster downtime.
>> Can't one do this even with only one server? We have always run both
>> slurmctld and slurmdbd on one machine and have performed all the updates
>> without any downtime.
> For minor upgrade 17.11.x to 17.11.y there is no issue because the MySQL
> database layout is unchanged.
> Major upgrades such as 17.11 to 18.08 is potentially more risky, see for example
> this list thread "Extreme long db upgrade 16.05.6 -> 17.11.3":
> I recommend to study the instructions in
That is indeed the protocol we follow.
> See also the slides on "Upgrading" in
> https://slurm.schedmd.com/SLUG18/field_notes2.pdf from the SLUG meeting 2018
> Updating the database layout during a Slurm major upgrade can in special
> situations lead to problems, so it's safer to do the upgrade separately for
> slurmdbd and slurmctld. This is why I've decided to move my slurmdbd and
> database to a separate server now. The slurmctld which governs the entire
> cluster is thereby unaffected as I "play" with the database upgrade, and I can
> upgrade Slurm without any cluster downtime.
I don't understand how the separation of the two services onto two
machines in the production environment makes such a difference. No
matter where the slurmdbd is running, the slurmcltd will attempt to
contact it and cache data if the slurmdbd is unreachable. Or is the
point more that, with a second machine you can do an offline conversion
of the database, i.e. it is good to have a test and a production
On the other hand, the Quick Start Addmin Guide
(https://slurm.schedmd.com/quickstart_admin.html) does mention "head
node, compute nodes, and slurmdbd node". I had always assumed a
separate slurmdbd node was mainly useful for performance reasons at
sites will a huge throughput of jobs, but maybe I am missing something.
[snip (26 lines)]
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.bennett at fu-berlin.de
More information about the slurm-users