[slurm-users] ML Training task killed(SIGKILL) when cgroup cpu limit enabled in slurm15.08

Jack Chen scsvip at gmail.com
Fri Jul 2 05:34:15 UTC 2021

Slurm is great to use, I've developed several plugins on it. Now I'm
working on an issue in slurm.

I'm using Slurm 15.08-11, after I enabled cgroup, some training job's task
is killed after a few hours. This can be reproduced several times. After
turning off cgroup, it disappears.

Linux kernel: 3.10.0-327.36.3.el7.x86_64

Slurm version: 15.08-11
example of killed job log:

srun: error ip-65: task 42: Killed
sun: Terminating job step 10346.0
slurmstepd: *** STEP 10346.0 ON ip-54 CANCELLED AT 2021-06-07T02:35:36 ***
srun: error: ip-65: tasks 40,46 Killed
srun: error: ip-65: tasks 45 Killed
srun: error: ip-57: tasks 19-21 Killed

job logs:

$ sacct -j 10310646 --format=JobID,State,ExitCode,DerivedExitCode,start
       JobID      State ExitCode DerivedExitCode               Start
------------ ---------- -------- --------------- -------------------
10310646      COMPLETED      0:9             0:0 2021-06-06T19:34:04


I only enabled ConstrainCores:

#ConstrainKmemSpace=no #avoid known Kernel issues
TaskAffinity=no #use task/affinity plugin instead

changes in slurm.conf to enable cgroup cpu


Maybe slurm or os's oom-killer?

I checked worker nodes dmesg logs: grep -i 'killed process'
/var/log/messages, grep -i 'oom' /var/log/messagesand find nothing

So any clues about how to fix this?

PS: upgrading the slurm version is almost impossible. I'm familiar with
slurm code, so I want to fix it in slurm 15.08
