[slurm-users] Re: Node in drain state

16 Sep 2025


      On 9/16/25 07:38, Gestió Servidors via slurm-users wrote:
...
[root@login-node ~]# sinfo
PARTITION     TIMELIMIT      AVAIL      STATE 
NODELIST                                 CPU_LOAD   NODES(A/I) NODES(A/I/ 
O/T)       CPUS  CPUS(A/I/O/T) REASON
*node.q*       4:00:00         up    drained 
clus09                                   0.00              0/0        
0/0/1/1         12      0/0/12/12 Kill task faile*
The *Kill task failed* reason is due to the UnkillableStepTimeout [1] 
configuration:
...
The length of time, in seconds, that Slurm will wait before deciding that processes in a job step are unkillable (after they have been signaled with SIGKILL) and execute UnkillableStepProgram. The default timeout value is 60 seconds or five times the value of MessageTimeout, whichever is greater. If exceeded, the compute node will be drained to prevent future jobs from being scheduled on the node.
...
But it seems there is no error in node... Slurmctld.log in server seems 
correct, too.
The slurmctld won't have any errors.  The node has errors due to 
UnkillableStepTimeout and therefore Slurm has drained it.
...
Is there any way to reset node to “state=idle” after errors in the same way?
First you have to investigate if the jobid's user has any processes left 
behind on the compute node.  It may very well be stale I/O from the job to 
a network file server.
It may also happen that the I/O was actually completed *after* Slurm 
drained the node, and all user processes have completed.  In this case you 
may simply "resume" the node xxx:
$ scontrol update nodename=xxx state=resume
However, if stale user processes continue to exist, your only choice is to 
reboot the node and tell Slurm to resume node xxx:
$ scontrol reboot asap nextstate=resume reason="Kill task failed" xxx
IHTH,
Ole
[1] https://slurm.schedmd.com/slurm.conf.html#OPT_UnkillableStepTimeout

2025

2024

[slurm-users] Re: Node in drain state