[slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

Tue Mar 16 10:03:18 UTC 2021

Hi David,

On Tue, 16 Mar 2021 at 06:34, Chin,David <dwc62 at drexel.edu> wrote:

> * UoM notice: External email. Be cautious of links, attachments, or
> impersonation attempts *
> ------------------------------
> Hi, Sean:
>
> Slurm version 20.02.6 (via Bright Cluster Manager)
>
>   ProctrackType=proctrack/cgroup
>   JobAcctGatherType=jobacct_gather/linux
>   JobAcctGatherParams=UsePss,NoShared
>
>
> I just skimmed https://bugs.schedmd.com/show_bug.cgi?id=5549 because this
> job appeared to have left two slurmstepd zombie processes running at
> 100%CPU each, and changed to:
>
>   ProctrackType=proctrack/cgroup
>   JobAcctGatherType=jobacct_gather/cgroup
>   JobAcctGatherParams=UsePss,NoShared,NoOverMemoryKill
>

You definitely want the NoOverMemoryKill option for JobAcctGatherParams.
This allows cgroups to kill the job, instead of Slurm accounting.

>
>
> Have asked the user to re-run the job, but that has not happened, yet.
>
> cgroup.conf:
>
>   CgroupMountpoint="/sys/fs/cgroup"
>   CgroupAutomount=yes
>   TaskAffinity=yes
>   ConstrainCores=yes
>   ConstrainRAMSpace=yes
>   ConstrainSwapSpace=no
>   ConstrainDevices=yes
>   ConstrainKmemSpace=yes
>   AllowedRamSpace=100.00
>   AllowedSwapSpace=0.00
>   MinKmemSpace=200
>   MaxKmemPercent=100.00
>   MemorySwappiness=100
>   MaxRAMPercent=100.00
>   MaxSwapPercent=100.00
>   MinRAMSpace=200
>

This looks good too. Our site does not restrict kmem space, but at least
now you'll see why cgroups kills the job (on the compute node, cgroup will
show the memory used at the time of the job kill), so you can see if it is
kmem related.

Sean

>
>
> Cheers,
>     Dave
>
> --
> David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
> dwc62 at drexel.edu                     215.571.4335 (o)
> For URCF support: urcf-support at drexel.edu
> https://proteusmaster.urcf.drexel.edu/urcfwiki
> github:prehensilecode
>
>
> ------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Sean Crosby <scrosby at unimelb.edu.au>
> *Sent:* Monday, March 15, 2021 15:22
> *To:* Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] [EXT] Job ended with OUT_OF_MEMORY even
> though MaxRSS and MaxVMSize are under the ReqMem value
>
>
> External.
> What are your Slurm settings - what's the values of
>
> ProctrackType
> JobAcctGatherType
> JobAcctGatherParams
>
> and what's the contents of cgroup.conf? Also, what version of Slurm are
> you using?
>
> Sean
>
> --
> Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
> Research Computing Services | Business Services
> The University of Melbourne, Victoria 3010 Australia
>
>
> Drexel Internal Data
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210316/cdeb3bec/attachment-0001.htm>