[slurm-users] Re: Node in drain state

19 Sep 2025


      ...
On 9/16/25 07:38, Gestió Servidors via slurm-users wrote:
...
Is there any way to reset node to “state=idle” after errors in the 
same way?
First you have to investigate if the jobid's user has any processes left 
behind on the compute node.  It may very well be stale I/O from the job 
to a network file server.
It may also happen that the I/O was actually completed *after* Slurm 
drained the node, and all user processes have completed.  In this case 
you may simply "resume" the node xxx:
$ scontrol update nodename=xxx state=resume
However, if stale user processes continue to exist, your only choice is 
to reboot the node and tell Slurm to resume node xxx:
$ scontrol reboot asap nextstate=resume reason="Kill task failed" xxx
We just now had a "Kill task failed" event on a node which caused it to 
drain, and Slurm Triggers then sent an E-mail alert to the sysadmin.
Logging in to the node I found a user process left behind after the 
Slurm job had been killed:
$ ps auxw | sed /root/d
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
username   29160 97.4  1.3 13770416 10415916 ?   D    Sep17 2926:25 
/home/username/...
As you can see, the process state is "D".  According to the "ps" manual 
D means "uninterruptible sleep (usually IO)".
In this case the only possible fix is to reboot the node, thereby 
forcibly terminating the frozen I/O on the network file server.
IHTH,
Ole

2025

2024

[slurm-users] Re: Node in drain state