<div dir="ltr"><p style="margin:0px 0px 15px;color:rgb(0,0,0);font-family:Helvetica,arial,sans-serif;font-size:14px">Slurm is great to use, I've developed several plugins on it. Now I'm working on an issue in slurm.<br></p><p style="margin:0px 0px 15px;color:rgb(0,0,0);font-family:Helvetica,arial,sans-serif;font-size:14px">I'm using Slurm 15.08-11, after I enabled cgroup, some training job's task is killed after a few hours. This can be reproduced several times. After turning off cgroup, it disappears. </p><p style="margin:15px 0px;color:rgb(0,0,0);font-family:Helvetica,arial,sans-serif;font-size:14px">Linux kernel: 3.10.0-327.36.3.el7.x86_64</p><p style="margin:15px 0px;color:rgb(0,0,0);font-family:Helvetica,arial,sans-serif;font-size:14px">Slurm version: 15.08-11</p><h3 id="gmail-m_8496037081089877081gmail-m_-7086754160346369735gmail-toc_0" style="margin:20px 0px 10px;padding:0px;font-size:18px;color:rgb(0,0,0);font-family:Helvetica,arial,sans-serif">example of killed job log:</h3><div><pre style="color:rgb(0,0,0);font-size:13px;white-space:pre-wrap;margin-top:15px;margin-bottom:15px;background-color:rgb(248,248,248);border:1px solid rgb(204,204,204);line-height:19px;overflow:auto;padding:6px 10px;border-radius:3px">srun: error ip-65: task 42: Killed<br>sun: Terminating job step 10346.0<br>slurmstepd: *** STEP 10346.0 ON ip-54 CANCELLED AT 2021-06-07T02:35:36 ***<br>srun: error: ip-65: tasks 40,46 Killed<br>srun: error: ip-65: tasks 45 Killed<br>srun: error: ip-57: tasks 19-21 Killed</pre></div><h4 id="gmail-m_8496037081089877081gmail-m_-7086754160346369735gmail-toc_1" style="margin:20px 0px 10px;padding:0px;font-size:16px;color:rgb(0,0,0);font-family:Helvetica,arial,sans-serif">job logs:</h4><div style="color:rgb(0,0,0);font-family:Helvetica,arial,sans-serif;font-size:14px"><pre style="white-space:pre-wrap;margin-top:15px;margin-bottom:15px;background-color:rgb(248,248,248);border:1px solid rgb(204,204,204);font-size:13px;line-height:19px;overflow:auto;padding:6px 10px;border-radius:3px"><code style="margin:0px;padding:0px;border:none;background-color:transparent;border-radius:3px">$ sacct -j 10310646 --format=JobID,State,ExitCode,DerivedExitCode,start
       JobID      State ExitCode DerivedExitCode               Start
------------ ---------- -------- --------------- -------------------
10310646      COMPLETED      0:9             0:0 2021-06-06T19:34:04</code></pre></div><h3 id="gmail-m_8496037081089877081gmail-m_-7086754160346369735gmail-toc_3" style="margin:20px 0px 10px;padding:0px;font-size:18px;color:rgb(0,0,0);font-family:Helvetica,arial,sans-serif">cgroup.conf:</h3><p style="margin:15px 0px;color:rgb(0,0,0);font-family:Helvetica,arial,sans-serif;font-size:14px">I only enabled ConstrainCores:</p><div style="color:rgb(0,0,0);font-family:Helvetica,arial,sans-serif;font-size:14px"><pre style="white-space:pre-wrap;margin-top:15px;margin-bottom:15px;background-color:rgb(248,248,248);border:1px solid rgb(204,204,204);font-size:13px;line-height:19px;overflow:auto;padding:6px 10px;border-radius:3px"><code style="margin:0px;padding:0px;border:none;background-color:transparent;border-radius:3px">AllowedDevicesFile="/etc/slurm/cgroup_allowed_devices_file.conf"
CgroupAutomount=yes
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainDevices=no
#ConstrainKmemSpace=no #avoid known Kernel issues
#ConstrainRAMSpace=yes
#AllowedRAMSpace=80
#ConstrainSwapSpace=yes
TaskAffinity=no #use task/affinity plugin instead</code></pre></div><h3 id="gmail-m_8496037081089877081gmail-m_-7086754160346369735gmail-toc_4" style="margin:20px 0px 10px;padding:0px;font-size:18px;color:rgb(0,0,0);font-family:Helvetica,arial,sans-serif">changes in slurm.conf to enable cgroup cpu</h3><div style="color:rgb(0,0,0);font-family:Helvetica,arial,sans-serif;font-size:14px"><pre style="white-space:pre-wrap;margin-top:15px;margin-bottom:15px;background-color:rgb(248,248,248);border:1px solid rgb(204,204,204);font-size:13px;line-height:19px;overflow:auto;padding:6px 10px;border-radius:3px"><code style="margin:0px;padding:0px;border:none;background-color:transparent;border-radius:3px"> ProctrackType=proctrack/cgroup
 TaskPlugin=task/cgroup,task/affinity</code></pre></div><p style="margin:15px 0px;color:rgb(0,0,0);font-family:Helvetica,arial,sans-serif;font-size:14px">Maybe slurm or os's oom-killer?</p><p style="margin:15px 0px 0px;color:rgb(0,0,0);font-family:Helvetica,arial,sans-serif;font-size:14px">I checked worker nodes dmesg logs: <code style="margin:0px 2px;padding:0px 5px;white-space:nowrap;border:1px solid rgb(234,234,234);background-color:rgb(248,248,248);border-radius:3px">grep -i 'killed process' /var/log/messages</code>, <code style="margin:0px 2px;padding:0px 5px;white-space:nowrap;border:1px solid rgb(234,234,234);background-color:rgb(248,248,248);border-radius:3px">grep -i 'oom' /var/log/messages</code>and find nothing</p><p style="margin:15px 0px 0px;color:rgb(0,0,0);font-family:Helvetica,arial,sans-serif;font-size:14px">So any clues about how to fix this? </p><p style="margin:15px 0px 0px;color:rgb(0,0,0);font-family:Helvetica,arial,sans-serif;font-size:14px"><br>PS: upgrading the slurm version is almost impossible. I'm familiar with slurm code, so I want to fix it in slurm 15.08</p></div>