[slurm-users] [EXTERNAL] SlurmDBD losing connection to the backend MariaDB
greg.wickham at kaust.edu.sa
Tue Nov 1 11:10:17 UTC 2022
While trying to respond I was looking into the manual pages and while it does appear that slurm can support some kind of high availability(*) it doesn’t seem simple.
With multiple slurmctld only one can be active at any time as they share state information. It’s not clear how they know about each other, so this may require STONITH(*).
With slurmdbd, there’s “AccountingStorageHost” and “AccountingStorageBackupHost”, again it’s not quite clear how these interact.
In slrmdbd.conf there is “StorageBackupHost” with the description:
. . . . It is up to the backup solution to enforce the coherency of the
accounting information between the two hosts. With clustered
database solutions (active/passive HA), you would not need to use
this feature. Default is none.
On our site we’re running only a simple setup. One VM with slurmctld and another VM with both slurmdbd+mariadbd.
Perhaps others who have dabbled with redundancy can reply.
(* I say this trusting the best way to get a response on the Internet is say something wrong and then wait for the avalanche of corrections).
On 01/11/2022, 12:08, "slurm-users" <slurm-users-bounces at lists.schedmd.com> wrote:
I have a two node set up. node1 is primary slurmctld + backup slurmdbd and node2 is primary slurmdbd + backup slurmctld and mysql database host.
My concern is if node 2 goes down, then the backup slurmdbd will take over, then what will happen ?
I have read that slurmctld can cache data, but what about slurmdbd? Not sure.
I have intentionally used the slurmdbd + mariadb in the second node because I didn't want to overload the primary slurmctld.
I hope you all are getting the picture of how my set up is.
On 11/1/2022 10:40 AM, Greg Wickham wrote:
Slurmctld caches the updates until slurmdbd comes back online.
You can see how many records are pending for the database by using the “sdiag” command and looking for “DBD Agent queue size”.
If this number grows significantly it means that slurmdbd isn’t available.
On 01/11/2022, 07:23, "slurm-users" <slurm-users-bounces at lists.schedmd.com><mailto:slurm-users-bounces at lists.schedmd.com> wrote:
Just for my info, I would like to know what happens when SlurmDBD loses
connection to the backend Database, for ex, MariaDB.
Does it cache the accounting info and keep them till the DB comes back
up ?, or does it panic and shut down ?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users