Dear community,
Trying to implement the latest fix/patch for munged, we restarted the updated munged locally on the compute nodes with "systemctl restart munged", resulting in the sudden death of a lot of compute nodes' slurmd.
Checking the jobs on the affected nodes, we saw a lot of user processes/jobs still running, which was good - yet "systemctl restart slurmd" cancelled all of them, eg.
[2026-02-12T17:08:00.325] Cleaning up stray StepId=49695760.extern [2026-02-12T17:08:00.325] [49695760.extern] Sent signal 9 to StepId=49695760.extern [2026-02-12T17:08:00.325] slurmd version 25.05.5 started and all affected user jobs (even though having survived the death of their parent slurmd) were killed and re-queued...
We have cgroups v2 (only, no hybrid), "Delegate=yes" in the slurmd unit and "ProctrackType=proctrack/cgroup" configured.
Other sites do not see the same behavior (their user jobs survive a slurmd restart without issues), so now we are at a loss figuring out why the h.... this happens within our setup.
Anyone experienced similar problems and got them solved...?
Thanks in advance -
-- ___________________________ Christian Griebel/HPC
On 2/12/26 2:56 pm, Griebel, Christian via slurm-users wrote:
Anyone experienced similar problems and got them solved...?
No, sorry, updating the munge RPM for us across 5000+ nodes with running went without a hitch.
You weren't trying to change the munge key at the same time were you?
All the best, Chris
I think the service name is munge not munged, although the binary is munged.
Or was your 'systemctl restart munged' a typo?
William
On Thu, 12 Feb 2026, 19:58 Griebel, Christian via slurm-users, < slurm-users@lists.schedmd.com> wrote:
Dear community,
Trying to implement the latest fix/patch for munged, we restarted the updated munged locally on the compute nodes with "systemctl restart munged", resulting in the sudden death of a lot of compute nodes' slurmd.
Checking the jobs on the affected nodes, we saw a lot of user processes/jobs still running, which was good - yet "systemctl restart slurmd" cancelled all of them, eg.
[2026-02-12T17:08:00.325] Cleaning up stray StepId=49695760.extern [2026-02-12T17:08:00.325] [49695760.extern] Sent signal 9 to StepId=49695760.extern [2026-02-12T17:08:00.325] slurmd version 25.05.5 started and all affected user jobs (even though having survived the death of their parent slurmd) were killed and re-queued...
We have cgroups v2 (only, no hybrid), "Delegate=yes" in the slurmd unit and "ProctrackType=proctrack/cgroup" configured.
Other sites do not see the same behavior (their user jobs survive a slurmd restart without issues), so now we are at a loss figuring out why the h.... this happens within our setup.
Anyone experienced similar problems and got them solved...?
Thanks in advance -
-- ___________________________ Christian Griebel/HPC
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
... thanks for your first answers -
Or was your 'systemctl restart munged' a typo?
... yes, that was a typo - it wreaked havoc without the "d"...
You weren't trying to change the munge key at the same time were you?
No, that was planned for a later (down) time, though - our /etc/munge/munge.key was untouched during the package update & restart.
That smells like the munge key was changed,
... it wasn't - unless a restart of the munge service causes a new key to be created which I doubt ;-)
I have also asked next door @ bugs.schedmd.com yet without a contract, I have little hope of being helped there.
-- ___________________________ Christian Griebel/HPC
That smells like the munge key was changed, which would require the behavior you see.
Brian Andrus
On 2/12/2026 11:56 AM, Griebel, Christian via slurm-users wrote:
Dear community,
Trying to implement the latest fix/patch for munged, we restarted the updated munged locally on the compute nodes with "systemctl restart munged", resulting in the sudden death of a lot of compute nodes' slurmd.
Checking the jobs on the affected nodes, we saw a lot of user processes/jobs still running, which was good - yet "systemctl restart slurmd" cancelled all of them, eg.
[2026-02-12T17:08:00.325] Cleaning up stray StepId=49695760.extern [2026-02-12T17:08:00.325] [49695760.extern] Sent signal 9 to StepId=49695760.extern [2026-02-12T17:08:00.325] slurmd version 25.05.5 started and all affected user jobs (even though having survived the death of their parent slurmd) were killed and re-queued...
We have cgroups v2 (only, no hybrid), "Delegate=yes" in the slurmd unit and "ProctrackType=proctrack/cgroup" configured.
Other sites do not see the same behavior (their user jobs survive a slurmd restart without issues), so now we are at a loss figuring out why the h.... this happens within our setup.
Anyone experienced similar problems and got them solved...?
Thanks in advance -
-- ___________________________ Christian Griebel/HPC
Dear Christian,
On 2/12/26 20:56, Griebel, Christian via slurm-users wrote:
Trying to implement the latest fix/patch for munged, we restarted the updated munged locally on the compute nodes with "systemctl restart munged", resulting in the sudden death of a lot of compute nodes' slurmd.
What is your OS? What method did you use for updating the Munge software?
If you use the RPM package installation method, updating the munge* packages will automatically restart the "munge" Systemd service without any other user intervention. This worked perfectly for us (700 nodes). The slurmd service on the compute nodes isn't affected by the restarted munge service.
Best regards, Ole