Community:
What do you do to ensure database reliability in your SLURM environment? We can have multiple controllers and multiple slurmdbds, but my understanding is that slurmdbd can be configured with a single MySQL server, so what do you do? Do you have that “single MySQL server” be a cluster, such as Percona XtraDB? Do you use MySQL replication, then manually switch to slurmdbd to a replication slave if the master goes down? Do you do something else?
Thanks.
Daniel
IIUC the database is not "critical": if it goes down, you lose access to some statistics. But job data gets cached anyway and the db will be updated when it comes back online.
Diego
Il 22/01/2024 18:23, Daniel L'Hommedieu ha scritto:
Community:
What do you do to ensure database reliability in your SLURM environment? We can have multiple controllers and multiple slurmdbds, but my understanding is that slurmdbd can be configured with a single MySQL server, so what do you do? Do you have that “single MySQL server” be a cluster, such as Percona XtraDB? Do you use MySQL replication, then manually switch to slurmdbd to a replication slave if the master goes down? Do you do something else?
Thanks.
Daniel
Hi Diego.
In our setup, the database is critical. We have some wrapper scripts that consult the database for information, and we also set environment variables on login, based on user/partition associations. If the database is down, none of those things work.
I doubt there is appetite in the organization to change the way our setup works, but if we can improve database reliability, that would be a good solution. Mostly I am interested in protecting from hardware failure, and that’s why I’m interested in a cluster solution such as XtraDB.
Thanks.
Daniel
On Jan 23, 2024, at 03:23, Diego Zuccato diego.zuccato@unibo.it wrote:
IIUC the database is not "critical": if it goes down, you lose access to some statistics. But job data gets cached anyway and the db will be updated when it comes back online.
Diego
Il 22/01/2024 18:23, Daniel L'Hommedieu ha scritto:
Community: What do you do to ensure database reliability in your SLURM environment? We can have multiple controllers and multiple slurmdbds, but my understanding is that slurmdbd can be configured with a single MySQL server, so what do you do? Do you have that “single MySQL server” be a cluster, such as Percona XtraDB? Do you use MySQL replication, then manually switch to slurmdbd to a replication slave if the master goes down? Do you do something else? Thanks. Daniel
-- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786
Hi,
We are using Percona XtraDB cluster to achieve HA for our Slurm databases. There is a single virtual IP that will be kept on one of the cluster's servers using keepalived.
Regards, Xand ________________________________ From: slurm-users slurm-users-bounces@lists.schedmd.com on behalf of Daniel L'Hommedieu dlhommedieu@gmail.com Sent: 22 January 2024 17:23 To: Slurm User Community List slurm-users@lists.schedmd.com Subject: [slurm-users] Database cluster
[You don't often get email from dlhommedieu@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
Community:
What do you do to ensure database reliability in your SLURM environment? We can have multiple controllers and multiple slurmdbds, but my understanding is that slurmdbd can be configured with a single MySQL server, so what do you do? Do you have that “single MySQL server” be a cluster, such as Percona XtraDB? Do you use MySQL replication, then manually switch to slurmdbd to a replication slave if the master goes down? Do you do something else?
Thanks.
Daniel
Xand,
Thanks - that’s great to hear. I was thinking of using Anycast to achieve the same thing, but good to know that keepalived is a viable solution as well.
Best, Daniel
On Jan 23, 2024, at 09:29, Xand Meaden xand.meaden@kcl.ac.uk wrote:
Hi,
We are using Percona XtraDB cluster to achieve HA for our Slurm databases. There is a single virtual IP that will be kept on one of the cluster's servers using keepalived.
Regards, Xand From: slurm-users <slurm-users-bounces@lists.schedmd.com mailto:slurm-users-bounces@lists.schedmd.com> on behalf of Daniel L'Hommedieu <dlhommedieu@gmail.com mailto:dlhommedieu@gmail.com> Sent: 22 January 2024 17:23 To: Slurm User Community List <slurm-users@lists.schedmd.com mailto:slurm-users@lists.schedmd.com> Subject: [slurm-users] Database cluster
[You don't often get email from dlhommedieu@gmail.com mailto:dlhommedieu@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification https://aka.ms/LearnAboutSenderIdentification ]
Community:
What do you do to ensure database reliability in your SLURM environment? We can have multiple controllers and multiple slurmdbds, but my understanding is that slurmdbd can be configured with a single MySQL server, so what do you do? Do you have that “single MySQL server” be a cluster, such as Percona XtraDB? Do you use MySQL replication, then manually switch to slurmdbd to a replication slave if the master goes down? Do you do something else?
Thanks.
Daniel
Hi Daniel,
We run a simple Galera-MySQL Cluster and have a HAproxy running on all clients to steer the requests (round-Robin) to one of the DB-nodes that answer the health check properly.
Best, Andreas
Am 23.01.2024 um 15:35 schrieb Daniel L'Hommedieu dlhommedieu@gmail.com:
Xand,
Thanks - that’s great to hear. I was thinking of using Anycast to achieve the same thing, but good to know that keepalived is a viable solution as well.
Best, Daniel
On Jan 23, 2024, at 09:29, Xand Meaden <xand.meaden@kcl.ac.ukmailto:xand.meaden@kcl.ac.uk> wrote:
Hi,
We are using Percona XtraDB cluster to achieve HA for our Slurm databases. There is a single virtual IP that will be kept on one of the cluster's servers using keepalived.
Regards, Xand ________________________________ From: slurm-users <slurm-users-bounces@lists.schedmd.commailto:slurm-users-bounces@lists.schedmd.com> on behalf of Daniel L'Hommedieu <dlhommedieu@gmail.commailto:dlhommedieu@gmail.com> Sent: 22 January 2024 17:23 To: Slurm User Community List <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> Subject: [slurm-users] Database cluster
[You don't often get email from dlhommedieu@gmail.commailto:dlhommedieu@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
Community:
What do you do to ensure database reliability in your SLURM environment? We can have multiple controllers and multiple slurmdbds, but my understanding is that slurmdbd can be configured with a single MySQL server, so what do you do? Do you have that “single MySQL server” be a cluster, such as Percona XtraDB? Do you use MySQL replication, then manually switch to slurmdbd to a replication slave if the master goes down? Do you do something else?
Thanks.
Daniel
To protect from HW failure, and to have more free hands when upgrading underlying OS, we use virtualization with "live migration"/HA and MariaDB server as a VM.
VM is easy to backup, restore as a snapshot, clone for possible tests, etc.
In the past, I deployed (customer-requirement) one site using a Galera cluster, but the high availability solution introduced a new level of configuration complexity, which was IMO not helping to the total system availability.
cheers
josef
On 22. 01. 24 18:23, Daniel L'Hommedieu wrote:
...database reliability in your SLURM environment...
We do the same as Josef - we run the database on a VM (single VM, MariaDB) and leave it up to (in our case) VMWare to ensure its availability.
Tina
On 25/01/2024 11:34, Josef Dvoracek wrote:
To protect from HW failure, and to have more free hands when upgrading underlying OS, we use virtualization with "live migration"/HA and MariaDB server as a VM.
VM is easy to backup, restore as a snapshot, clone for possible tests, etc.
In the past, I deployed (customer-requirement) one site using a Galera cluster, but the high availability solution introduced a new level of configuration complexity, which was IMO not helping to the total system availability.
cheers
josef
On 22. 01. 24 18:23, Daniel L'Hommedieu wrote:
...database reliability in your SLURM environment...