[slurm-users] ML Training task killed(SIGKILL) when cgroup cpu limit enabled in slurm15.08
scsvip at gmail.com
Fri Jul 2 10:39:34 UTC 2021
ok, thanks for your quick response, I will find a way to upgrade it.
On Fri, Jul 2, 2021 at 2:12 PM Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk>
> On 7/2/21 7:34 AM, Jack Chen wrote:
> > Slurm is great to use, I've developed several plugins on it. Now I'm
> > working on an issue in slurm.
> > I'm using Slurm 15.08-11, after I enabled cgroup, some training job's
> > is killed after a few hours. This can be reproduced several times. After
> > turning off cgroup, it disappears.
> > Linux kernel: 3.10.0-327.36.3.el7.x86_64
> > Slurm version: 15.08-11
> For Cgroups support I believe you need to upgrade to a much more recent
> Slurm version!! Probably Slurm 17.02.5 or later, see
> > PS: upgrading the slurm version is almost impossible. I'm familiar with
> > slurm code, so I want to fix it in slurm 15.08
> IMHO, you will suffer many problems if you stick with this old 15.08
> release. It is definitely feasible to upgrade Slurm, although you have to
> take great care with the database upgrade if upgrading from 17.02 or
> older. Upgrading between recent versions is quite straightforward, but it
> is imperative that you upgrade by at most 2 versions at a time!
> I have collected upgrading experience and documentation here:
> Best regards,
> Ole Holm Nielsen
> PhD, Senior HPC Officer
> Department of Physics, Technical University of Denmark
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users