Dear community,
Trying to implement the latest fix/patch for munged, we restarted the updated munged locally on the compute nodes with "systemctl restart munged", resulting in the sudden death of a lot of compute nodes' slurmd.
Checking the jobs on the affected nodes, we saw a lot of user processes/jobs still running, which was good - yet "systemctl restart slurmd" cancelled all of them, eg.
We have cgroups v2 (only, no hybrid), "Delegate=yes" in the slurmd unit and "ProctrackType=proctrack/cgroup" configured.
Other sites do not see the same behavior (their user jobs survive a slurmd restart without issues), so now we are at a loss figuring out why the h.... this happens within our setup.
Anyone experienced similar problems and got them solved...?
Thanks in advance -
--
___________________________
Christian Griebel/HPC