[slurm-users] cgroup clean up after "Kill task failed"
Paul Raines
raines at nmr.mgh.harvard.edu
Tue Feb 16 17:23:56 UTC 2021
For the first time I had a node drained due to the reason of
"Kill task failed"
# scontrol show node=rtx-07 | grep -i reason
Reason=Kill task failed [root at 2021-02-15T13:02:57]
>From the node's slurmd.log
[2021-02-15T12:59:27.192] [100418.0] debug2: switch/none:
switch_p_job_postfini: Sending SIGKILL to pgid 2259793
[2021-02-15T12:59:56.909] [100418.extern] Sent SIGKILL signal to
StepId=100418.extern
...
...
[2021-02-15T13:02:47.627] [100418.extern] Sent SIGKILL signal to
StepId=100418.extern
[2021-02-15T13:02:57.000] [100418.extern] error: *** EXTERN STEP FOR 100418
STEPD TERMINATED ON rtx-07 AT 2021-02-15T13:02:56 DUE TO JOB NOT ENDING WITH
SIGNALS ***
[root at rtx-07 ~]# find /sys/fs/cgroup/ -name 'job_100418*'
/sys/fs/cgroup/freezer/slurm/uid_4979000/job_100418
/sys/fs/cgroup/devices/slurm/uid_4979000/job_100418
/sys/fs/cgroup/memory/slurm/uid_4979000/job_100418
/sys/fs/cgroup/cpuset/slurm/uid_4979000/job_100418
[root at rtx-07 ~]# cat
/sys/fs/cgroup/cpuset/slurm/uid_4979000/job_100418/cgroup.procs
[root at rtx-07 ~]# cat
/sys/fs/cgroup/cpuset/slurm/uid_4979000/job_100418/step_extern/cgroup.procs
[root at rtx-07 ~]#
I see no proc's actually running in this cgroup so it is not clear
to me what was not killed. I can find nothing in process group 2259793
So not sure if Slurm was confused or there really were processes
that ended later after 2021-02-15T13:02:56 but before I logged in to look.
Anyway, how do I clean up these "job_100418"'s in /sys/fs/cgroup
without rebooting?
---------------------------------------------------------------
Paul Raines http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown, MA 02129 USA
More information about the slurm-users
mailing list