[slurm-users] SlurmDBD losing connection to the backend MariaDB
Richard Chang
rchang.lists at gmail.com
Wed Nov 2 14:14:20 UTC 2022
Hello Brian,
Thank you for the reply and sharing your design. Can you please share
your MariaDB server HA details.? ( Can be offline and DM to me )
I would like to understand it so that I can replicate it here.
Thanks & regards,
Richard.
On 11/2/2022 8:09 AM, Brian Andrus wrote:
> RC,
>
> In that scenario, the backup slurmdbd would take over, but then its
> database would not necessarily be in sync with the 'main' database
> (hence the warnings/info about it in the documentation).
>
> For my setup, I have 2 slurmdbd hosts, but they both connect to the
> same, separate, MariaDB server, which is HA. Now, I can take down the
> primary slurmdbd system and the other will takeover, so I can bring
> them up/down as needed for updates, etc.
>
> If your two slurmdbd servers use different databases, you would need a
> way to keep them in sync, regardless of which slurmdbd was processing
> data. There are many ways to do that, but those designs fall under
> MariaDB and not Slurm.
>
> Brian Andrus
>
> On 11/1/2022 6:49 PM, Richard Chang wrote:
>> Does it mean it is best to use a single slurmdbd host in my case?
>>
>> My primary slurmctld is the backup slurmdbd host, and my worry is if
>> the primary slurmdbd host ( which is also the mariadb server) goes
>> down, will the backup slurmdbd be able to cache data and wait till
>> the mariadb catches up ?
>>
>> Thanks,
>>
>> RC
>>
>> On 11/2/2022 2:00 AM, Brian Andrus wrote:
>>> Ole,
>>>
>>> Fair enough, it is actually slurmctld that does the caching.
>>> Technical typo on my part there.
>>>
>>> Just trying to let the user know, there is a window that they have
>>> to ensure no information is lost during a database outage.
>>>
>>> Brian Andrus
>>>
>>> On 11/1/2022 1:43 AM, Ole Holm Nielsen wrote:
>>>> Hi Brian,
>>>>
>>>> On 11/1/22 05:28, Brian Andrus wrote:
>>>>> It caches up to a point. As I understand it, that is about an hour
>>>>> (depending on size and how busy the cluster is, as well as
>>>>> available memory, etc).
>>>>
>>>> Have you found any documentation of slurmdbd caching? It's
>>>> well-known that slurmctld caches information while slurmdbd is
>>>> down, see for example page 30 in the talk "Field Notes Mark 2:
>>>> Random Musings From Under A New Hat"[1] by Tim Wickberg, SchedMD:
>>>>
>>>>> For slurmdbd, the critical element in the failure domain is
>>>>> MySQL, not slurmdbd. slurmdbd itself is stateless.
>>>>> ● slurmctld will cache accounting records (up to a limit) if
>>>>> slurmdbd is unavailable. This can be hours+ to days+
>>>>> depending on your system without data loss.
>>>>
>>>> The statelessness of slurmdbd makes me think that it can't cache
>>>> any data.
>>>>
>>>> Thanks,
>>>> Ole
>>>>
>>>> [1] https://slurm.schedmd.com/publications.html
>>>>
>>>>> On 10/31/2022 9:20 PM, Richard Chang wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Just for my info, I would like to know what happens when SlurmDBD
>>>>>> loses connection to the backend Database, for ex, MariaDB.
>>>>>>
>>>>>> Does it cache the accounting info and keep them till the DB comes
>>>>>> back up ?, or does it panic and shut down ?
>>>>
>>>
>>
>
More information about the slurm-users
mailing list