[slurm-users] Migrate the slurmdbd service to another server

Mon Mar 4 15:30:24 UTC 2019

Hi Ole,

Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk> writes:

> On 3/4/19 2:26 PM, Loris Bennett wrote:
>> Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk> writes:
>>> We're one of the many Slurm sites which run the slurmdbd database daemon on the
>>> same server as the slurmctld daemon.  This works without problems at our site
>>> given our modest load, however, SchedMD recommends to run the daemons on
>>> separate servers.
>>>
>>> Contemplating how to upgrade our cluster from Slurm 17.11 to 18.08, I've come to
>>> appreciate the advantage of running the daemons on separate servers: One can
>>> upgrade slurmdbd to 18.08 while keeping slurmctld at 17.11 (for a while at
>>> least).  This enables us to upgrade to 18.08 in the recommended order without
>>> any interruption to our running jobs and without any cluster downtime.
>>
>> Can't one do this even with only one server?  We have always run both
>> slurmctld and slurmdbd on one machine and have performed all the updates
>> without any downtime.
>
> For minor upgrade 17.11.x to 17.11.y there is no issue because the MySQL
> database layout is unchanged.
>
> Major upgrades such as 17.11 to 18.08 is potentially more risky, see for example
> this list thread "Extreme long db upgrade 16.05.6 -> 17.11.3":
> https://lists.schedmd.com/pipermail/slurm-users/2018-February/000612.html
>
> I recommend to study the instructions in
> https://slurm.schedmd.com/quickstart_admin.html#upgrade.

That is indeed the protocol we follow.

> See also the slides on "Upgrading" in
> https://slurm.schedmd.com/SLUG18/field_notes2.pdf from the SLUG meeting 2018
> (https://slurm.schedmd.com/publications.html).
>
> Updating the database layout during a Slurm major upgrade can in special
> situations lead to problems, so it's safer to do the upgrade separately for
> slurmdbd and slurmctld.  This is why I've decided to move my slurmdbd and
> database to a separate server now.  The slurmctld which governs the entire
> cluster is thereby unaffected as I "play" with the database upgrade, and I can
> upgrade Slurm without any cluster downtime.

I don't understand how the separation of the two services onto two
machines in the production environment makes such a difference.  No
matter where the slurmdbd is running, the slurmcltd will attempt to
contact it and cache data if the slurmdbd is unreachable.  Or is the
point more that, with a second machine you can do an offline conversion
of the database, i.e. it is good to have a test and a production
environment?

On the other hand, the Quick Start Addmin Guide
(https://slurm.schedmd.com/quickstart_admin.html) does mention "head
node, compute nodes, and slurmdbd node".  I had always assumed a
separate slurmdbd node was mainly useful for performance reasons at
sites will a huge throughput of jobs, but maybe I am missing something. 

[snip (26 lines)]

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de