[slurm-users] ML Training task killed(SIGKILL) when cgroup cpu limit enabled in slurm15.08
Jack Chen
scsvip at gmail.com
Fri Jul 2 05:34:15 UTC 2021
Slurm is great to use, I've developed several plugins on it. Now I'm
working on an issue in slurm.
I'm using Slurm 15.08-11, after I enabled cgroup, some training job's task
is killed after a few hours. This can be reproduced several times. After
turning off cgroup, it disappears.
Linux kernel: 3.10.0-327.36.3.el7.x86_64
Slurm version: 15.08-11
example of killed job log:
srun: error ip-65: task 42: Killed
sun: Terminating job step 10346.0
slurmstepd: *** STEP 10346.0 ON ip-54 CANCELLED AT 2021-06-07T02:35:36 ***
srun: error: ip-65: tasks 40,46 Killed
srun: error: ip-65: tasks 45 Killed
srun: error: ip-57: tasks 19-21 Killed
job logs:
$ sacct -j 10310646 --format=JobID,State,ExitCode,DerivedExitCode,start
JobID State ExitCode DerivedExitCode Start
------------ ---------- -------- --------------- -------------------
10310646 COMPLETED 0:9 0:0 2021-06-06T19:34:04
cgroup.conf:
I only enabled ConstrainCores:
AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf"
CgroupAutomount=yes
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainDevices=no
#ConstrainKmemSpace=no #avoid known Kernel issues
#ConstrainRAMSpace=yes
#AllowedRAMSpace=80
#ConstrainSwapSpace=yes
TaskAffinity=no #use task/affinity plugin instead
changes in slurm.conf to enable cgroup cpu
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
Maybe slurm or os's oom-killer?
I checked worker nodes dmesg logs: grep -i 'killed process'
/var/log/messages, grep -i 'oom' /var/log/messagesand find nothing
So any clues about how to fix this?
PS: upgrading the slurm version is almost impossible. I'm familiar with
slurm code, so I want to fix it in slurm 15.08
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210702/016796d5/attachment-0001.htm>
More information about the slurm-users
mailing list