On 10/21/24 4:35 am, laddaoui--- via slurm-users wrote:
It seems like there's an issue with the termination process on these nodes. Any thoughts on what could be causing this?
That usually means processes wedged in the kernel for some reason, in an uninterruptible sleep state. You can define an "UnkillableStepProgram" to be run on the node when that happens to capture useful state info. You can do that by doing things like iterating through processes in the jobs cgroup dumping their `/proc/$PID/stack` somewhere useful, getting the `ps` info for all those same processes, and/or doing an `echo w > /proc/sysrq-trigger` to make the kernel dump all blocked tasks.
All the best, Chris