[slurm-users] SlurmDBD losing connection to the backend MariaDB

Richard Chang rchang.lists at gmail.com
Wed Nov 2 14:14:20 UTC 2022


Hello Brian,

Thank you for the reply and sharing your design. Can you please share 
your MariaDB server HA details.? ( Can be offline and DM to me )

I would like to understand it so that I can replicate it  here.

Thanks & regards,

Richard.

On 11/2/2022 8:09 AM, Brian Andrus wrote:
> RC,
>
> In that scenario, the backup slurmdbd would take over, but then its 
> database would not necessarily be in sync with the 'main' database 
> (hence the warnings/info about it in the documentation).
>
> For my setup, I have 2 slurmdbd hosts, but they both connect to the 
> same, separate, MariaDB server, which is HA. Now, I can take down the 
> primary slurmdbd system and the other will takeover, so I can bring 
> them up/down as needed for updates, etc.
>
> If your two slurmdbd servers use different databases, you would need a 
> way to keep them in sync, regardless of which slurmdbd was processing 
> data. There are many ways to do that, but those designs fall under 
> MariaDB and not Slurm.
>
> Brian Andrus
>
> On 11/1/2022 6:49 PM, Richard Chang wrote:
>> Does it mean it is best to use a single slurmdbd host in my case?
>>
>> My primary slurmctld is the backup slurmdbd host, and my worry is if 
>> the primary slurmdbd host ( which is also the mariadb server) goes 
>> down, will the backup slurmdbd be able to cache data and wait till 
>> the mariadb catches up ?
>>
>> Thanks,
>>
>> RC
>>
>> On 11/2/2022 2:00 AM, Brian Andrus wrote:
>>> Ole,
>>>
>>> Fair enough, it is actually slurmctld that does the caching. 
>>> Technical typo on my part there.
>>>
>>> Just trying to let the user know, there is a window that they have 
>>> to ensure no information is lost during a database outage.
>>>
>>> Brian Andrus
>>>
>>> On 11/1/2022 1:43 AM, Ole Holm Nielsen wrote:
>>>> Hi Brian,
>>>>
>>>> On 11/1/22 05:28, Brian Andrus wrote:
>>>>> It caches up to a point. As I understand it, that is about an hour 
>>>>> (depending on size and how busy the cluster is, as well as 
>>>>> available memory, etc).
>>>>
>>>> Have you found any documentation of slurmdbd caching?  It's 
>>>> well-known that slurmctld caches information while slurmdbd is 
>>>> down, see for example page 30 in the talk "Field Notes Mark 2: 
>>>> Random Musings From Under A New Hat"[1] by Tim Wickberg, SchedMD:
>>>>
>>>>> For slurmdbd, the critical element in the failure domain is
>>>>> MySQL, not slurmdbd. slurmdbd itself is stateless.
>>>>> ● slurmctld will cache accounting records (up to a limit) if
>>>>> slurmdbd is unavailable. This can be hours+ to days+
>>>>> depending on your system without data loss.
>>>>
>>>> The statelessness of slurmdbd makes me think that it can't cache 
>>>> any data.
>>>>
>>>> Thanks,
>>>> Ole
>>>>
>>>> [1] https://slurm.schedmd.com/publications.html
>>>>
>>>>> On 10/31/2022 9:20 PM, Richard Chang wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Just for my info, I would like to know what happens when SlurmDBD 
>>>>>> loses connection to the backend Database, for ex, MariaDB.
>>>>>>
>>>>>> Does it cache the accounting info and keep them till the DB comes 
>>>>>> back up ?, or does it panic and shut down ?
>>>>
>>>
>>
>



More information about the slurm-users mailing list