On 9/16/25 07:38, Gestió Servidors via slurm-users wrote:
Is there any way to reset node to “state=idle” after errors in the same way?
First you have to investigate if the jobid's user has any processes left behind on the compute node. It may very well be stale I/O from the job to a network file server.
It may also happen that the I/O was actually completed *after* Slurm drained the node, and all user processes have completed. In this case you may simply "resume" the node xxx:
$ scontrol update nodename=xxx state=resume
However, if stale user processes continue to exist, your only choice is to reboot the node and tell Slurm to resume node xxx:
$ scontrol reboot asap nextstate=resume reason="Kill task failed" xxx
We just now had a "Kill task failed" event on a node which caused it to drain, and Slurm Triggers then sent an E-mail alert to the sysadmin.
Logging in to the node I found a user process left behind after the Slurm job had been killed:
$ ps auxw | sed /root/d USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND username 29160 97.4 1.3 13770416 10415916 ? D Sep17 2926:25 /home/username/...
As you can see, the process state is "D". According to the "ps" manual D means "uninterruptible sleep (usually IO)".
In this case the only possible fix is to reboot the node, thereby forcibly terminating the frozen I/O on the network file server.
IHTH, Ole