[slurm-users] ML Training task killed(SIGKILL) when cgroup cpu limit enabled in slurm15.08

Jack Chen scsvip at gmail.com
Fri Jul 2 10:39:34 UTC 2021


ok, thanks for your quick response, I will find a way to upgrade it.

On Fri, Jul 2, 2021 at 2:12 PM Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk>
wrote:

> On 7/2/21 7:34 AM, Jack Chen wrote:
> > Slurm is great to use, I've developed several plugins on it. Now I'm
> > working on an issue in slurm.
> >
> > I'm using Slurm 15.08-11, after I enabled cgroup, some training job's
> task
> > is killed after a few hours. This can be reproduced several times. After
> > turning off cgroup, it disappears.
> >
> > Linux kernel: 3.10.0-327.36.3.el7.x86_64
> >
> > Slurm version: 15.08-11
>
> For Cgroups support I believe you need to upgrade to a much more recent
> Slurm version!!  Probably Slurm 17.02.5 or later, see
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#cgroup-configuration
>
> > PS: upgrading the slurm version is almost impossible. I'm familiar with
> > slurm code, so I want to fix it in slurm 15.08
>
> IMHO, you will suffer many problems if you stick with this old 15.08
> release.  It is definitely feasible to upgrade Slurm, although you have to
> take great care with the database upgrade if upgrading from 17.02 or
> older.  Upgrading between recent versions is quite straightforward, but it
> is imperative that you upgrade by at most 2 versions at a time!
>
> I have collected upgrading experience and documentation here:
> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
>
> Best regards,
> Ole
>
> --
> Ole Holm Nielsen
> PhD, Senior HPC Officer
> Department of Physics, Technical University of Denmark
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210702/b2f8c128/attachment.htm>


More information about the slurm-users mailing list