[slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value
Chin,David
dwc62 at drexel.edu
Mon Mar 15 19:34:12 UTC 2021
Hi, Sean:
Slurm version 20.02.6 (via Bright Cluster Manager)
ProctrackType=proctrack/cgroup
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherParams=UsePss,NoShared
I just skimmed https://bugs.schedmd.com/show_bug.cgi?id=5549 because this job appeared to have left two slurmstepd zombie processes running at 100%CPU each, and changed to:
ProctrackType=proctrack/cgroup
JobAcctGatherType=jobacct_gather/cgroup
JobAcctGatherParams=UsePss,NoShared,NoOverMemoryKill
Have asked the user to re-run the job, but that has not happened, yet.
cgroup.conf:
CgroupMountpoint="/sys/fs/cgroup"
CgroupAutomount=yes
TaskAffinity=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=no
ConstrainDevices=yes
ConstrainKmemSpace=yes
AllowedRamSpace=100.00
AllowedSwapSpace=0.00
MinKmemSpace=200
MaxKmemPercent=100.00
MemorySwappiness=100
MaxRAMPercent=100.00
MaxSwapPercent=100.00
MinRAMSpace=200
Cheers,
Dave
--
David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel
dwc62 at drexel.edu 215.571.4335 (o)
For URCF support: urcf-support at drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode
________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Sean Crosby <scrosby at unimelb.edu.au>
Sent: Monday, March 15, 2021 15:22
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value
External.
What are your Slurm settings - what's the values of
ProctrackType
JobAcctGatherType
JobAcctGatherParams
and what's the contents of cgroup.conf? Also, what version of Slurm are you using?
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
Drexel Internal Data
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210315/4834bb72/attachment.htm>
More information about the slurm-users
mailing list