[slurm-users] ML Training task killed(SIGKILL) when cgroup cpu limit enabled in slurm15.08

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Fri Jul 2 06:10:42 UTC 2021


On 7/2/21 7:34 AM, Jack Chen wrote:
> Slurm is great to use, I've developed several plugins on it. Now I'm 
> working on an issue in slurm.
> 
> I'm using Slurm 15.08-11, after I enabled cgroup, some training job's task 
> is killed after a few hours. This can be reproduced several times. After 
> turning off cgroup, it disappears.
> 
> Linux kernel: 3.10.0-327.36.3.el7.x86_64
> 
> Slurm version: 15.08-11

For Cgroups support I believe you need to upgrade to a much more recent 
Slurm version!!  Probably Slurm 17.02.5 or later, see
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#cgroup-configuration

> PS: upgrading the slurm version is almost impossible. I'm familiar with 
> slurm code, so I want to fix it in slurm 15.08

IMHO, you will suffer many problems if you stick with this old 15.08 
release.  It is definitely feasible to upgrade Slurm, although you have to 
take great care with the database upgrade if upgrading from 17.02 or 
older.  Upgrading between recent versions is quite straightforward, but it 
is imperative that you upgrade by at most 2 versions at a time!

I have collected upgrading experience and documentation here:
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm

Best regards,
Ole

-- 
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark



More information about the slurm-users mailing list