[slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value

Chin,David dwc62 at drexel.edu
Mon Mar 15 19:34:12 UTC 2021


Hi, Sean:

Slurm version 20.02.6 (via Bright Cluster Manager)

  ProctrackType=proctrack/cgroup
  JobAcctGatherType=jobacct_gather/linux
  JobAcctGatherParams=UsePss,NoShared


I just skimmed https://bugs.schedmd.com/show_bug.cgi?id=5549 because this job appeared to have left two slurmstepd zombie processes running at 100%CPU each, and changed to:

  ProctrackType=proctrack/cgroup
  JobAcctGatherType=jobacct_gather/cgroup
  JobAcctGatherParams=UsePss,NoShared,NoOverMemoryKill

Have asked the user to re-run the job, but that has not happened, yet.

cgroup.conf:

  CgroupMountpoint="/sys/fs/cgroup"
  CgroupAutomount=yes
  TaskAffinity=yes
  ConstrainCores=yes
  ConstrainRAMSpace=yes
  ConstrainSwapSpace=no
  ConstrainDevices=yes
  ConstrainKmemSpace=yes
  AllowedRamSpace=100.00
  AllowedSwapSpace=0.00
  MinKmemSpace=200
  MaxKmemPercent=100.00
  MemorySwappiness=100
  MaxRAMPercent=100.00
  MaxSwapPercent=100.00
  MinRAMSpace=200


Cheers,
    Dave

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dwc62 at drexel.edu                     215.571.4335 (o)
For URCF support: urcf-support at drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode


________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Sean Crosby <scrosby at unimelb.edu.au>
Sent: Monday, March 15, 2021 15:22
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value


External.

What are your Slurm settings - what's the values of

ProctrackType
JobAcctGatherType
JobAcctGatherParams

and what's the contents of cgroup.conf? Also, what version of Slurm are you using?

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia



Drexel Internal Data
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210315/4834bb72/attachment.htm>


More information about the slurm-users mailing list