Hi all,
We hit a snag when updating our clusters from Slurm 23.02 to 24.05. After updating the slurmdbd, our multi cluster setup was broken until everything was updated to 24.05. We had not anticipated this.
SchedMD says that fixing it would be a very complex operation.
Hence, this warning to everybody on planning to update: make sure to quickly updating everything once you've updated the slurmdbd daemon.
Reference: https://support.schedmd.com/show_bug.cgi?id=20931
Ward
Ward Poelmans via slurm-users slurm-users@lists.schedmd.com writes:
We hit a snag when updating our clusters from Slurm 23.02 to 24.05. After updating the slurmdbd, our multi cluster setup was broken until everything was updated to 24.05. We had not anticipated this.
When you say "everything", do you mean all the slurmctlds, or also all slurmds?
Hi Bjørn-Helge,
On 26/09/2024 09:50, Bjørn-Helge Mevik via slurm-users wrote:
Ward Poelmans via slurm-users slurm-users@lists.schedmd.com writes:
We hit a snag when updating our clusters from Slurm 23.02 to 24.05. After updating the slurmdbd, our multi cluster setup was broken until everything was updated to 24.05. We had not anticipated this.
When you say "everything", do you mean all the slurmctlds, or also all slurmds?
Yes, the issue was gone after *everything* was upgraded: the slurmctld, slurmd and login nodes.
Ward
On Sep 26, 2024, at 15:03, Ward Poelmans via slurm-users slurm-users@lists.schedmd.com wrote:
Hi Bjørn-Helge,
On 26/09/2024 09:50, Bjørn-Helge Mevik via slurm-users wrote: Ward Poelmans via slurm-users slurm-users@lists.schedmd.com writes: We hit a snag when updating our clusters from Slurm 23.02 to 24.05. After updating the slurmdbd, our multi cluster setup was broken until everything was updated to 24.05. We had not anticipated this. When you say "everything", do you mean all the slurmctlds, or also all slurmds?
Yes, the issue was gone after *everything* was upgraded: the slurmctld, slurmd and login nodes.
Ward, apologies for reopening your ticket and marking it sev 1 (which apparently is possible!), but their response to this is unsatisfactory. I can understand not wanting to change the code if they made such a large mistake and it's hard to reverse. However, fixing the upgrade guide is something that should be done in the next hour, let alone two days later. They could be generating production outages right now at sites that are following their directions that promise compatibility.
Thank you for saving those of us that read this list from that major headache!
-- #BlackLivesMatter ____ || \UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj@rutgers.edu || \ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `'