[slurm-users] draining nodes due to failed killing of task?

Bjørn-Helge Mevik b.h.mevik at usit.uio.no
Mon Aug 9 06:15:39 UTC 2021


Adrian Sevcenco <Adrian.Sevcenco at spacescience.ro> writes:

> Having just implemented some triggers i just noticed this:
>
> NODELIST    NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
> alien-0-47      1    alien*    draining   48   48:1:1 193324   214030      1 rack-0,4 Kill task failed
> alien-0-56      1    alien*     drained   48   48:1:1 193324   214030      1 rack-0,4 Kill task failed
>
> i was wondering why a node is drained when killing of task fails

I guess the heuristic is that something is wrong with the node, so it
should not run more jobs.  Like Disk-waits or similar that might require
a reboot.

> and how can i disable it? (i use cgroups)

I don't know how to disable it, but it can be tuned with:

       UnkillableStepTimeout
              The  length  of time, in seconds, that Slurm will wait before
              deciding that processes in a job step are  unkillable  (after
              they  have  been  signaled  with SIGKILL) and execute Unkill‐
              ableStepProgram.  The default timeout value  is  60  seconds.
              If  exceeded,  the  compute  node  will be drained to prevent
              future jobs from being scheduled on the node.

(Note though, that according to
https://bugs.schedmd.com/show_bug.cgi?id=11103 it should not be set
higher than 127 s.)

You might also want to look at this setting to find out what is going on
on the machine when Slurm cannot kill the job step:

       UnkillableStepProgram
              If  the  processes in a job step are determined to be unkill‐
              able for a period of time specified by the  UnkillableStepTi‐
              meout  variable,  the program specified by UnkillableStepPro‐
              gram will be executed.  By default no program is run.

              See section UNKILLABLE STEP PROGRAM SCRIPT for more  informa‐
              tion.

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210809/14c494b9/attachment.sig>


More information about the slurm-users mailing list