I have a cron job that emails me when hosts go into drain mode and tells me the reason (scontrol show node=$host | grep -i reason)
We get drains with the "Kill task failed" reason probably about 5 times a day. This despite having UnkillableStepTimeout=180
Right now we are still handling them manually by sshing to the node and running a script we wrote called clean_cgroup_jobs that looks for the unkilled processes using the cgroup info for the job
If it finds none, it deletes the cgroups for the job and we resume the node. This is true about 95% of the time.
In the case of a truly unkillable process, it lists the process and then we manually investigate. Often in this case it is hung NFS mount causing the problem and we have various ways of dealing with that that can involve faking the IP of the offline NFS server on another server to make the node client nfs kernel process finally exit.
In the rare case we can not find a way to kill the unkillable process we arrange to reboot the node.
-- Paul Raines (http://help.nmr.mgh.harvard.edu)
On Tue, 22 Oct 2024 12:59am, Christopher Samuel via slurm-users wrote:
External Email - Use Caution
On 10/21/24 4:35 am, laddaoui--- via slurm-users wrote:
It seems like there's an issue with the termination process on these nodes. Any thoughts on what could be causing this?
That usually means processes wedged in the kernel for some reason, in an uninterruptible sleep state. You can define an "UnkillableStepProgram" to be run on the node when that happens to capture useful state info. You can do that by doing things like iterating through processes in the jobs cgroup dumping their `/proc/$PID/stack` somewhere useful, getting the `ps` info for all those same processes, and/or doing an `echo w > /proc/sysrq-trigger` to make the kernel dump all blocked tasks.
All the best, Chris -- Chris Samuel : http://secure-web.cisco.com/1nkj9AvGGR14KG_wv9PtKtCMW_eu_C_6GKksFtwzqIHnSnp9... : Berkeley, CA, USA
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
The information in this e-mail is intended only for the person to whom it is addressed. If you believe this e-mail was sent to you in error and the e-mail contains patient information, please contact the Mass General Brigham Compliance HelpLine at https://www.massgeneralbrigham.org/complianceline https://www.massgeneralbrigham.org/complianceline . Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.