[slurm-users] ML Training task killed(SIGKILL) when cgroup cpu limit enabled in slurm15.08

Fri Jul 2 05:34:15 UTC 2021

Slurm is great to use, I've developed several plugins on it. Now I'm
working on an issue in slurm.

I'm using Slurm 15.08-11, after I enabled cgroup, some training job's task
is killed after a few hours. This can be reproduced several times. After
turning off cgroup, it disappears.

Linux kernel: 3.10.0-327.36.3.el7.x86_64

Slurm version: 15.08-11
example of killed job log:

srun: error ip-65: task 42: Killed
sun: Terminating job step 10346.0
slurmstepd: *** STEP 10346.0 ON ip-54 CANCELLED AT 2021-06-07T02:35:36 ***
srun: error: ip-65: tasks 40,46 Killed
srun: error: ip-65: tasks 45 Killed
srun: error: ip-57: tasks 19-21 Killed

job logs:

$ sacct -j 10310646 --format=JobID,State,ExitCode,DerivedExitCode,start
       JobID      State ExitCode DerivedExitCode               Start
------------ ---------- -------- --------------- -------------------
10310646      COMPLETED      0:9             0:0 2021-06-06T19:34:04

cgroup.conf:

I only enabled ConstrainCores:

AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf"
CgroupAutomount=yes
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainDevices=no
#ConstrainKmemSpace=no #avoid known Kernel issues
#ConstrainRAMSpace=yes
#AllowedRAMSpace=80
#ConstrainSwapSpace=yes
TaskAffinity=no #use task/affinity plugin instead

changes in slurm.conf to enable cgroup cpu

 ProctrackType=proctrack/cgroup
 TaskPlugin=task/cgroup,task/affinity

Maybe slurm or os's oom-killer?

I checked worker nodes dmesg logs: grep -i 'killed process'
/var/log/messages, grep -i 'oom' /var/log/messagesand find nothing

So any clues about how to fix this?

PS: upgrading the slurm version is almost impossible. I'm familiar with
slurm code, so I want to fix it in slurm 15.08
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210702/016796d5/attachment-0001.htm>