Restarting slurmd kills still-running jobs

List overview All Threads
Download

newer

older

Sharding GPUs

sstat is not able check others job.

Griebel, Christian

12 Feb 2026 12 Feb '26

2:56 p.m.

Dear community,

Trying to implement the latest fix/patch for munged, we restarted the updated munged locally on the compute nodes with "systemctl restart munged", resulting in the sudden death of a lot of compute nodes' slurmd.

Checking the jobs on the affected nodes, we saw a lot of user processes/jobs still running, which was good - yet "systemctl restart slurmd" cancelled all of them, eg.

[2026-02-12T17:08:00.325] Cleaning up stray StepId=49695760.extern [2026-02-12T17:08:00.325] [49695760.extern] Sent signal 9 to StepId=49695760.extern [2026-02-12T17:08:00.325] slurmd version 25.05.5 started and all affected user jobs (even though having survived the death of their parent slurmd) were killed and re-queued...

We have cgroups v2 (only, no hybrid), "Delegate=yes" in the slurmd unit and "ProctrackType=proctrack/cgroup" configured.

Other sites do not see the same behavior (their user jobs survive a slurmd restart without issues), so now we are at a loss figuring out why the h.... this happens within our setup.

Anyone experienced similar problems and got them solved...?

Thanks in advance -

-- ___________________________ Christian Griebel/HPC

Attachments:

attachment.html (text/html — 2.3 KB)

Show replies by date

Christopher Samuel

12 Feb 12 Feb

3:25 p.m.

On 2/12/26 2:56 pm, Griebel, Christian via slurm-users wrote:

...

Anyone experienced similar problems and got them solved...?

No, sorry, updating the munge RPM for us across 5000+ nodes with running went without a hitch.

You weren't trying to change the munge key at the same time were you?

All the best, Chris

William Brown

4:36 p.m.

I think the service name is munge not munged, although the binary is munged.

Or was your 'systemctl restart munged' a typo?

William

On Thu, 12 Feb 2026, 19:58 Griebel, Christian via slurm-users, < slurm-users@lists.schedmd.com> wrote:

...

Dear community,

Trying to implement the latest fix/patch for munged, we restarted the updated munged locally on the compute nodes with "systemctl restart munged", resulting in the sudden death of a lot of compute nodes' slurmd.

Checking the jobs on the affected nodes, we saw a lot of user processes/jobs still running, which was good - yet "systemctl restart slurmd" cancelled all of them, eg.

[2026-02-12T17:08:00.325] Cleaning up stray StepId=49695760.extern [2026-02-12T17:08:00.325] [49695760.extern] Sent signal 9 to StepId=49695760.extern [2026-02-12T17:08:00.325] slurmd version 25.05.5 started and all affected user jobs (even though having survived the death of their parent slurmd) were killed and re-queued...

We have cgroups v2 (only, no hybrid), "Delegate=yes" in the slurmd unit and "ProctrackType=proctrack/cgroup" configured.

Other sites do not see the same behavior (their user jobs survive a slurmd restart without issues), so now we are at a loss figuring out why the h.... this happens within our setup.

Anyone experienced similar problems and got them solved...?

Thanks in advance -

-- ___________________________ Christian Griebel/HPC

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Christian Griebel, HRZ/HPC

7:17 p.m.

... thanks for your first answers -

...

Or was your 'systemctl restart munged' a typo?

... yes, that was a typo - it wreaked havoc without the "d"...

...

You weren't trying to change the munge key at the same time were you?

No, that was planned for a later (down) time, though - our /etc/munge/munge.key was untouched during the package update & restart.

...

That smells like the munge key was changed,

... it wasn't - unless a restart of the munge service causes a new key to be created which I doubt ;-)

I have also asked next door @ bugs.schedmd.com yet without a contract, I have little hope of being helped there.

-- ___________________________ Christian Griebel/HPC

Brian Andrus

7 p.m.

That smells like the munge key was changed, which would require the behavior you see.

Brian Andrus

On 2/12/2026 11:56 AM, Griebel, Christian via slurm-users wrote:

...

Dear community,

Trying to implement the latest fix/patch for munged, we restarted the updated munged locally on the compute nodes with "systemctl restart munged", resulting in the sudden death of a lot of compute nodes' slurmd.

Checking the jobs on the affected nodes, we saw a lot of user processes/jobs still running, which was good - yet "systemctl restart slurmd" cancelled all of them, eg.

[2026-02-12T17:08:00.325] Cleaning up stray StepId=49695760.extern [2026-02-12T17:08:00.325] [49695760.extern] Sent signal 9 to StepId=49695760.extern [2026-02-12T17:08:00.325] slurmd version 25.05.5 started and all affected user jobs (even though having survived the death of their parent slurmd) were killed and re-queued...

We have cgroups v2 (only, no hybrid), "Delegate=yes" in the slurmd unit and "ProctrackType=proctrack/cgroup" configured.

Other sites do not see the same behavior (their user jobs survive a slurmd restart without issues), so now we are at a loss figuring out why the h.... this happens within our setup.

Anyone experienced similar problems and got them solved...?

Thanks in advance -

-- ___________________________ Christian Griebel/HPC

Ole Holm Nielsen

13 Feb 13 Feb

4:58 a.m.

Dear Christian,

On 2/12/26 20:56, Griebel, Christian via slurm-users wrote:

...

Trying to implement the latest fix/patch for munged, we restarted the updated munged locally on the compute nodes with "systemctl restart munged", resulting in the sudden death of a lot of compute nodes' slurmd.

What is your OS? What method did you use for updating the Munge software?

If you use the RPM package installation method, updating the munge* packages will automatically restart the "munge" Systemd service without any other user intervention. This worked perfectly for us (700 nodes). The slurmd service on the compute nodes isn't affected by the restarted munge service.

Best regards, Ole

Age (days ago)

Last active (days ago)

slurm-users@lists.schedmd.com

5 comments

6 participants

tags (0)

participants (6)

Brian Andrus
Christian Griebel, HRZ/HPC
Christopher Samuel
Griebel, Christian
Ole Holm Nielsen
William Brown