[slurm-users] SlurmDBD losing connection to the backend MariaDB

Brian Andrus toomuchit at gmail.com
Wed Nov 2 02:39:03 UTC 2022


In that scenario, the backup slurmdbd would take over, but then its 
database would not necessarily be in sync with the 'main' database 
(hence the warnings/info about it in the documentation).

For my setup, I have 2 slurmdbd hosts, but they both connect to the 
same, separate, MariaDB server, which is HA. Now, I can take down the 
primary slurmdbd system and the other will takeover, so I can bring them 
up/down as needed for updates, etc.

If your two slurmdbd servers use different databases, you would need a 
way to keep them in sync, regardless of which slurmdbd was processing 
data. There are many ways to do that, but those designs fall under 
MariaDB and not Slurm.

Brian Andrus

On 11/1/2022 6:49 PM, Richard Chang wrote:
> Does it mean it is best to use a single slurmdbd host in my case?
> My primary slurmctld is the backup slurmdbd host, and my worry is if 
> the primary slurmdbd host ( which is also the mariadb server) goes 
> down, will the backup slurmdbd be able to cache data and wait till the 
> mariadb catches up ?
> Thanks,
> RC
> On 11/2/2022 2:00 AM, Brian Andrus wrote:
>> Ole,
>> Fair enough, it is actually slurmctld that does the caching. 
>> Technical typo on my part there.
>> Just trying to let the user know, there is a window that they have to 
>> ensure no information is lost during a database outage.
>> Brian Andrus
>> On 11/1/2022 1:43 AM, Ole Holm Nielsen wrote:
>>> Hi Brian,
>>> On 11/1/22 05:28, Brian Andrus wrote:
>>>> It caches up to a point. As I understand it, that is about an hour 
>>>> (depending on size and how busy the cluster is, as well as 
>>>> available memory, etc).
>>> Have you found any documentation of slurmdbd caching?  It's 
>>> well-known that slurmctld caches information while slurmdbd is down, 
>>> see for example page 30 in the talk "Field Notes Mark 2: Random 
>>> Musings From Under A New Hat"[1] by Tim Wickberg, SchedMD:
>>>> For slurmdbd, the critical element in the failure domain is
>>>> MySQL, not slurmdbd. slurmdbd itself is stateless.
>>>> ● slurmctld will cache accounting records (up to a limit) if
>>>> slurmdbd is unavailable. This can be hours+ to days+
>>>> depending on your system without data loss.
>>> The statelessness of slurmdbd makes me think that it can't cache any 
>>> data.
>>> Thanks,
>>> Ole
>>> [1] https://slurm.schedmd.com/publications.html
>>>> On 10/31/2022 9:20 PM, Richard Chang wrote:
>>>>> Hi,
>>>>> Just for my info, I would like to know what happens when SlurmDBD 
>>>>> loses connection to the backend Database, for ex, MariaDB.
>>>>> Does it cache the accounting info and keep them till the DB comes 
>>>>> back up ?, or does it panic and shut down ?

More information about the slurm-users mailing list