<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><div dir="ltr"><div dir="ltr"><div>Hi David,</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, 16 Mar 2021 at 06:34, Chin,David <<a href="mailto:dwc62@drexel.edu">dwc62@drexel.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr"><div><div style="color:rgb(0,0,0);font-size:12px;text-align:left;font-family:Helvetica,Arial,sans-serif"><b><table style="width:100%;float:left;background-color:lemonchiffon" cellspacing="0" cellpadding="5" border="1">
<tbody>
<tr>
<td><b>UoM notice: </b>External email. Be cautious of links, attachments, or impersonation attempts</td>
</tr>
</tbody>
</table></b><br></div><hr></div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
<span style="font-family:"Courier New",monospace">Hi, Sean:</span></div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
<span style="font-family:"Courier New",monospace"><br>
</span></div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
<span style="font-family:"Courier New",monospace">Slurm version 20.02.6 (via Bright Cluster Manager)</span></div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
</div>
<span style="font-family:"Courier New",monospace"> ProctrackType=proctrack/cgroup</span>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
<span style="font-family:"Courier New",monospace"> JobAcctGatherType=jobacct_gather/linux</span><br>
</div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
<span style="font-family:"Courier New",monospace"> JobAcctGatherParams=UsePss,NoShared</span><br>
</div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
<span style="font-family:"Courier New",monospace">I just skimmed </span><a href="https://bugs.schedmd.com/show_bug.cgi?id=5549" id="gmail-m_9981108240971486LPlnk" target="_blank"><span style="font-family:"Courier New",monospace">https://bugs.schedmd.com/show_bug.cgi?id=5549</span></a><span style="font-family:"Courier New",monospace"> because
this job appeared to have left two slurmstepd zombie processes running at 100%CPU each, and changed to:</span></div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
<span style="font-family:"Courier New",monospace"><br>
</span></div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
<span style="margin:0px;font-size:12pt;font-family:"Courier New",monospace"> ProctrackType=proctrack/cgroup</span>
<div style="margin:0px;font-size:12pt"><span style="font-family:"Courier New",monospace"> JobAcctGatherType=jobacct_gather/cgroup</span><br>
</div>
<span style="margin:0px;font-size:12pt;font-family:"Courier New",monospace"> JobAcctGatherParams=UsePss,NoShared,NoOverMemoryKill</span></div></div></blockquote><div><br></div><div>You definitely want the NoOverMemoryKill option for JobAcctGatherParams. This allows cgroups to kill the job, instead of Slurm accounting.<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)"><br>
</div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
<span style="margin:0px;font-size:12pt;font-family:"Courier New",monospace"><br>
</span></div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
<span style="margin:0px;font-size:12pt;font-family:"Courier New",monospace">Have asked the user to re-run the job, but that has not happened, yet.</span></div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
<span style="margin:0px;font-size:12pt;font-family:"Courier New",monospace"><br>
</span></div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
<span style="margin:0px;font-size:12pt;font-family:"Courier New",monospace">cgroup.conf:</span></div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
<span style="margin:0px;font-size:12pt;font-family:"Courier New",monospace"><br>
</span></div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
<span style="margin:0px;font-size:12pt;font-family:"Courier New",monospace"> CgroupMountpoint="/sys/fs/cgroup"
<div> CgroupAutomount=yes</div>
<div> TaskAffinity=yes</div>
<div> ConstrainCores=yes</div>
<div> ConstrainRAMSpace=yes</div>
<div> ConstrainSwapSpace=no</div>
<div> ConstrainDevices=yes</div>
<div> ConstrainKmemSpace=yes</div>
<div> AllowedRamSpace=100.00</div>
<div> AllowedSwapSpace=0.00</div>
<div> MinKmemSpace=200</div>
<div> MaxKmemPercent=100.00</div>
<div> MemorySwappiness=100</div>
<div> MaxRAMPercent=100.00</div>
<div> MaxSwapPercent=100.00</div>
MinRAMSpace=200<br></span></div></div></blockquote><div><br></div><div>This looks good too. Our site does not restrict kmem space, but at least now you'll see why cgroups kills the job (on the compute node, cgroup will show the memory used at the time of the job kill), so you can see if it is kmem related.</div><div><br></div><div>Sean<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)"><span style="margin:0px;font-size:12pt;font-family:"Courier New",monospace">
</span></div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
<span style="margin:0px;font-size:12pt;font-family:"Courier New",monospace"><br>
</span></div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
Cheers,</div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
Dave</div>
<div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div id="gmail-m_9981108240971486Signature">
<div>
<div></div>
<div></div>
<div></div>
<div id="gmail-m_9981108240971486divtagdefaultwrapper" dir="ltr" style="font-size:12pt;color:rgb(0,0,0);font-family:"Courier New",monospace">
<div><font size="2"><span style="font-size:10pt">
<div></div>
<div style="font-family:"Courier New",monospace;font-size:13.3333px">
</div>
<span id="gmail-m_9981108240971486ms-rterangepaste-start"></span>
<div>--</div>
<div>
<div>David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel</div>
<div><a href="mailto:dwc62@drexel.edu" target="_blank">dwc62@drexel.edu</a> 215.571.4335 (o)</div>
<div>For URCF support: <a href="mailto:urcf-support@drexel.edu" target="_blank">urcf-support@drexel.edu</a></div>
<div><a href="https://proteusmaster.urcf.drexel.edu/urcfwiki" target="_blank">https://proteusmaster.urcf.drexel.edu/urcfwiki</a></div>
<div>github:prehensilecode</div>
</div>
<span id="gmail-m_9981108240971486ms-rterangepaste-end"></span>
<div><br>
</div>
</span></font></div>
</div>
</div>
</div>
</div>
<div id="gmail-m_9981108240971486appendonsend"></div>
<div style="font-family:"Courier New",monospace;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<hr style="display:inline-block;width:98%">
<div id="gmail-m_9981108240971486divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>From:</b> slurm-users <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>> on behalf of Sean Crosby <<a href="mailto:scrosby@unimelb.edu.au" target="_blank">scrosby@unimelb.edu.au</a>><br>
<b>Sent:</b> Monday, March 15, 2021 15:22<br>
<b>To:</b> Slurm User Community List <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a>><br>
<b>Subject:</b> Re: [slurm-users] [EXT] Job ended with OUT_OF_MEMORY even though MaxRSS and MaxVMSize are under the ReqMem value</font>
<div> </div>
</div>
<div>
<table width="100%">
<tbody>
<tr>
<td style="border-left:4px solid goldenrod;background:cornsilk none repeat scroll 0% 0%;padding:0px 3pt">
<p style="font:small-caps bold 100% sans-serif">External.</p>
</td>
</tr>
</tbody>
</table>
<div>
<div dir="ltr">
<div>What are your Slurm settings - what's the values of</div>
<div><br>
</div>
ProctrackType<br>
JobAcctGatherType<br>
JobAcctGatherParams<br>
<br>
<div>and what's the contents of cgroup.conf? Also, what version of Slurm are you using?<br>
</div>
<div><br>
</div>
<div>Sean</div>
<div><br>
</div>
<div>
<div>
<div dir="ltr">--<br>
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead<br>
Research Computing Services | Business Services<br>
The University of Melbourne, Victoria 3010 Australia<br>
<br>
</div>
</div>
</div>
</div>
</div>
</div>
<br>
<p style="font-family:Calibri;font-size:10pt;color:rgb(0,0,0);margin:5pt" align="Left">
Drexel Internal Data<br>
</p>
</div>
</blockquote></div></div>