Database cluster

List overview All Threads
Download

newer

older

sinfo: error:...

Problem using Podman with scrun on...

Daniel L'Hommedieu

22 Jan 2024 22 Jan '24

5:23 p.m.

Community:

What do you do to ensure database reliability in your SLURM environment? We can have multiple controllers and multiple slurmdbds, but my understanding is that slurmdbd can be configured with a single MySQL server, so what do you do? Do you have that “single MySQL server” be a cluster, such as Percona XtraDB? Do you use MySQL replication, then manually switch to slurmdbd to a replication slave if the master goes down? Do you do something else?

Thanks.

Daniel

Show replies by date

Diego Zuccato

23 Jan 23 Jan

8:23 a.m.

IIUC the database is not "critical": if it goes down, you lose access to some statistics. But job data gets cached anyway and the db will be updated when it comes back online.

Diego

Il 22/01/2024 18:23, Daniel L'Hommedieu ha scritto:

...

Community:

What do you do to ensure database reliability in your SLURM environment? We can have multiple controllers and multiple slurmdbds, but my understanding is that slurmdbd can be configured with a single MySQL server, so what do you do? Do you have that “single MySQL server” be a cluster, such as Percona XtraDB? Do you use MySQL replication, then manually switch to slurmdbd to a replication slave if the master goes down? Do you do something else?

Thanks.

Daniel

-- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Daniel L'Hommedieu

1:38 p.m.

Hi Diego.

In our setup, the database is critical. We have some wrapper scripts that consult the database for information, and we also set environment variables on login, based on user/partition associations. If the database is down, none of those things work.

I doubt there is appetite in the organization to change the way our setup works, but if we can improve database reliability, that would be a good solution. Mostly I am interested in protecting from hardware failure, and that’s why I’m interested in a cluster solution such as XtraDB.

Thanks.

Daniel

...

On Jan 23, 2024, at 03:23, Diego Zuccato diego.zuccato@unibo.it wrote:

IIUC the database is not "critical": if it goes down, you lose access to some statistics. But job data gets cached anyway and the db will be updated when it comes back online.

Diego

Il 22/01/2024 18:23, Daniel L'Hommedieu ha scritto:

...
Community: What do you do to ensure database reliability in your SLURM environment? We can have multiple controllers and multiple slurmdbds, but my understanding is that slurmdbd can be configured with a single MySQL server, so what do you do? Do you have that “single MySQL server” be a cluster, such as Percona XtraDB? Do you use MySQL replication, then manually switch to slurmdbd to a replication slave if the master goes down? Do you do something else? Thanks. Daniel

-- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Xand Meaden

2:29 p.m.

Hi,

We are using Percona XtraDB cluster to achieve HA for our Slurm databases. There is a single virtual IP that will be kept on one of the cluster's servers using keepalived.

Regards, Xand ________________________________ From: slurm-users slurm-users-bounces@lists.schedmd.com on behalf of Daniel L'Hommedieu dlhommedieu@gmail.com Sent: 22 January 2024 17:23 To: Slurm User Community List slurm-users@lists.schedmd.com Subject: [slurm-users] Database cluster

[You don't often get email from dlhommedieu@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]

Community:

Thanks.

Daniel

Daniel L'Hommedieu

2:34 p.m.

Xand,

Thanks - that’s great to hear. I was thinking of using Anycast to achieve the same thing, but good to know that keepalived is a viable solution as well.

Best, Daniel

...

On Jan 23, 2024, at 09:29, Xand Meaden xand.meaden@kcl.ac.uk wrote:

Hi,

We are using Percona XtraDB cluster to achieve HA for our Slurm databases. There is a single virtual IP that will be kept on one of the cluster's servers using keepalived.

Regards, Xand From: slurm-users <slurm-users-bounces@lists.schedmd.com mailto:slurm-users-bounces@lists.schedmd.com> on behalf of Daniel L'Hommedieu <dlhommedieu@gmail.com mailto:dlhommedieu@gmail.com> Sent: 22 January 2024 17:23 To: Slurm User Community List <slurm-users@lists.schedmd.com mailto:slurm-users@lists.schedmd.com> Subject: [slurm-users] Database cluster

[You don't often get email from dlhommedieu@gmail.com mailto:dlhommedieu@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification https://aka.ms/LearnAboutSenderIdentification ]

Community:

What do you do to ensure database reliability in your SLURM environment? We can have multiple controllers and multiple slurmdbds, but my understanding is that slurmdbd can be configured with a single MySQL server, so what do you do? Do you have that “single MySQL server” be a cluster, such as Percona XtraDB? Do you use MySQL replication, then manually switch to slurmdbd to a replication slave if the master goes down? Do you do something else?

Thanks.

Daniel

Henkel, Andreas

24 Jan 24 Jan

7:37 p.m.

Hi Daniel,

We run a simple Galera-MySQL Cluster and have a HAproxy running on all clients to steer the requests (round-Robin) to one of the DB-nodes that answer the health check properly.

Best, Andreas

Am 23.01.2024 um 15:35 schrieb Daniel L'Hommedieu dlhommedieu@gmail.com:

Xand,

Thanks - that’s great to hear. I was thinking of using Anycast to achieve the same thing, but good to know that keepalived is a viable solution as well.

Best, Daniel

On Jan 23, 2024, at 09:29, Xand Meaden <xand.meaden@kcl.ac.ukmailto:xand.meaden@kcl.ac.uk> wrote:

Hi,

We are using Percona XtraDB cluster to achieve HA for our Slurm databases. There is a single virtual IP that will be kept on one of the cluster's servers using keepalived.

Regards, Xand ________________________________ From: slurm-users <slurm-users-bounces@lists.schedmd.commailto:slurm-users-bounces@lists.schedmd.com> on behalf of Daniel L'Hommedieu <dlhommedieu@gmail.commailto:dlhommedieu@gmail.com> Sent: 22 January 2024 17:23 To: Slurm User Community List <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> Subject: [slurm-users] Database cluster

[You don't often get email from dlhommedieu@gmail.commailto:dlhommedieu@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]

Community:

Thanks.

Daniel

Josef Dvoracek

25 Jan 25 Jan

11:34 a.m.

To protect from HW failure, and to have more free hands when upgrading underlying OS, we use virtualization with "live migration"/HA and MariaDB server as a VM.

VM is easy to backup, restore as a snapshot, clone for possible tests, etc.

In the past, I deployed (customer-requirement) one site using a Galera cluster, but the high availability solution introduced a new level of configuration complexity, which was IMO not helping to the total system availability.

cheers

josef

On 22. 01. 24 18:23, Daniel L'Hommedieu wrote:

...

...database reliability in your SLURM environment...

Tina Friedrich

26 Jan 26 Jan

2:45 p.m.

We do the same as Josef - we run the database on a VM (single VM, MariaDB) and leave it up to (in our case) VMWare to ensure its availability.

Tina

On 25/01/2024 11:34, Josef Dvoracek wrote:

...

To protect from HW failure, and to have more free hands when upgrading underlying OS, we use virtualization with "live migration"/HA and MariaDB server as a VM.

VM is easy to backup, restore as a snapshot, clone for possible tests, etc.

In the past, I deployed (customer-requirement) one site using a Galera cluster, but the high availability solution introduced a new level of configuration complexity, which was IMO not helping to the total system availability.

cheers

josef

On 22. 01. 24 18:23, Daniel L'Hommedieu wrote:

...
...database reliability in your SLURM environment...

547

Age (days ago)

551

Last active (days ago)

slurm-users@lists.schedmd.com

7 comments

6 participants

tags (0)

participants (6)

Daniel L'Hommedieu
Diego Zuccato
Henkel, Andreas
Josef Dvoracek
Tina Friedrich
Xand Meaden