[slurm-users] cgroup clean up after "Kill task failed"

Tue Feb 16 17:23:56 UTC 2021

For the first time I had a node drained due to the reason of
"Kill task failed"

# scontrol show node=rtx-07 | grep -i reason
    Reason=Kill task failed [root at 2021-02-15T13:02:57]

>From the node's slurmd.log

[2021-02-15T12:59:27.192] [100418.0] debug2: switch/none:
switch_p_job_postfini: Sending SIGKILL to pgid 2259793
[2021-02-15T12:59:56.909] [100418.extern] Sent SIGKILL signal to
StepId=100418.extern 
...
...
[2021-02-15T13:02:47.627] [100418.extern] Sent SIGKILL signal to 
StepId=100418.extern
[2021-02-15T13:02:57.000] [100418.extern] error: *** EXTERN STEP FOR 100418 
STEPD TERMINATED ON rtx-07 AT 2021-02-15T13:02:56 DUE TO JOB NOT ENDING WITH 
SIGNALS ***

[root at rtx-07 ~]# find /sys/fs/cgroup/ -name 'job_100418*'
/sys/fs/cgroup/freezer/slurm/uid_4979000/job_100418
/sys/fs/cgroup/devices/slurm/uid_4979000/job_100418
/sys/fs/cgroup/memory/slurm/uid_4979000/job_100418
/sys/fs/cgroup/cpuset/slurm/uid_4979000/job_100418

[root at rtx-07 ~]# cat 
/sys/fs/cgroup/cpuset/slurm/uid_4979000/job_100418/cgroup.procs
[root at rtx-07 ~]# cat 
/sys/fs/cgroup/cpuset/slurm/uid_4979000/job_100418/step_extern/cgroup.procs
[root at rtx-07 ~]#

I see no proc's actually running in this cgroup so it is not clear
to me what was not killed.  I can find nothing in process group 2259793

So not sure if Slurm was confused or there really were processes
that ended later after 2021-02-15T13:02:56 but before I logged in to look.

Anyway, how do I clean up these "job_100418"'s in /sys/fs/cgroup
without rebooting?

---------------------------------------------------------------
Paul Raines                     http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street     Charlestown, MA 02129	    USA