<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)">
Hi, Xand:</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)">
How does adding "ReqMem" to the sacct change the output?</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
E.g. on my cluster running Slurm 20.02.7 (on RHEL8), our GPU nodes have TRESBillingWeights=CPU=0,Mem=0,GRES/gpu=43:</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
<br>
</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
<span style="color: rgb(0, 0, 0); font-family: "Courier New", monospace; font-size: 12pt;">$ sacct --format=JobID%25,State,AllocTRES%50,ReqTRES,ReqMem,ReqCPUS|grep RUNNING</span></div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
<span style="color: rgb(0, 0, 0); font-family: "Courier New", monospace; font-size: 12pt;"><br>
</span></div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
<span style="color: rgb(0, 0, 0); font-family: "Courier New", monospace; font-size: 12pt;"> JobID State AllocTRES ReqTRES ReqMem ReqCPUS<br>
------------------------- ---------- -------------------------------------------------- ---------- ---------- --------<br>
</span></div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
<span style="color: rgb(0, 0, 0); font-family: "Courier New", monospace; font-size: 12pt;"> 2512977.batch RUNNING cpu=48,mem=0,node=1 0n 48</span><br>
</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
<div> 2512977.extern RUNNING billing=516,cpu=144,gres/gpu=12,node=3 0n 144</div>
<div> 2512977.0 RUNNING cpu=24,gres/gpu:v100=8,gres/gpu=8,mem=0,node=2 0n 24</div>
2513020 RUNNING billing=516,cpu=144,gres/gpu=12,node=3 billing=5+ 0n 144<br>
</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
<br>
</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
I.e. note the "mem=0", and absence of the mem field on some of those lines. In squeue:</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
<br>
</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
JOBID PART NAME USER STATE TIME TIME_LIMIT NODES MIN_MEMO NODELIST(REASON)<br>
2512977 gpu 1AB_96DMPCLoose_ ba553 RUNNING 22:29:20 1-00:00:00 3 0 gpu[001,003-004]<br>
</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
<br>
</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
<br>
</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
In comparison, a job on our def partition which requests a specific amount of mem:</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
<br>
</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
(sacct)</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
JobID State AllocTRES ReqTRES ReqMem ReqCPUS<br>
------------------------- ---------- -------------------------------------------------- ---------- ---------- --------<br>
</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
2514854 RUNNING billing=1,cpu=1,mem=36G,node=1 billing=1+ 36Gn 1
<div> 2514854.batch RUNNING cpu=1,mem=36G,node=1 36Gn 1</div>
2514854.extern RUNNING billing=1,cpu=1,mem=36G,node=1 36Gn 1<br>
</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
<br>
</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
and the squeue line:</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
<br>
</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
JOBID PART NAME USER STATE TIME TIME_LIMIT NODES MIN_MEMO NODELIST(REASON)<br>
2514854 def ClusterJobStart_ sbradley RUNNING 5:05:27 8:00:00 1 36G node003<br>
</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
<br>
</div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)" class="elementToProof">
<br>
</div>
<div>
<div style="font-family:"Courier New",monospace; font-size:12pt; color:rgb(0,0,0)">
<br>
</div>
<div id="Signature">
<div>
<div></div>
<div></div>
<div></div>
<div></div>
<div></div>
<div id="divtagdefaultwrapper" dir="ltr" style="font-size:12pt; color:#000000; font-family:'Courier New',monospace">
<div class="BodyFragment"><font size="2"><span style="font-size:10pt">
<div class="PlainText"></div>
<div class="PlainText" style="font-family:"Courier New",monospace; font-size:13.3333px">
</div>
<span id="ms-rterangepaste-start"></span>
<div>--</div>
<div>
<div>David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel</div>
<div>dwc62@drexel.edu 215.571.4335 (o)</div>
<div>For URCF support: urcf-support@drexel.edu</div>
<div>https://proteusmaster.urcf.drexel.edu/urcfwiki</div>
<div>github:prehensilecode</div>
</div>
<div class="PlainText"><span></span></div>
</span></font></div>
</div>
</div>
</div>
</div>
<div id="appendonsend"></div>
<hr tabindex="-1" style="display:inline-block; width:98%">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> slurm-users <slurm-users-bounces@lists.schedmd.com> on behalf of Xand Meaden <xand.meaden@kcl.ac.uk><br>
<b>Sent:</b> Wednesday, January 12, 2022 12:23<br>
<b>To:</b> slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com><br>
<b>Subject:</b> [slurm-users] Memory usage not tracked</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt">
<div class="PlainText">External.<br>
<br>
Hi,<br>
<br>
We wish to record memory usage of HPC jobs, but with Slurm 20.11 cannot<br>
get this to work - the information is simply missing. Our two older<br>
clusters with Slurm 19.05 will record memory usage as a TRES, e.g. as<br>
shown below:<br>
<br>
# sacct --format=JobID,State,AllocTRES%32|grep RUNNING|head -4<br>
14029267 RUNNING billing=32,cpu=32,mem=185600M,n+<br>
14037739 RUNNING billing=64,cpu=64,mem=250G,node+<br>
14037739.ba+ RUNNING cpu=32,mem=125G,node=1<br>
14037739.0 RUNNING cpu=1,mem=4000M,node=1<br>
<br>
However with 20.11 we see no memory usage:<br>
<br>
# sacct --format=JobID,State,AllocTRES%32|grep RUNNING|head -4<br>
771 RUNNING billing=36,cpu=36,node=1<br>
771.batch RUNNING cpu=36,mem=0,node=1<br>
816 RUNNING billing=128,cpu=128,node=1<br>
823 RUNNING billing=36,cpu=36,node=1<br>
<br>
I've also checked within the slurm database's cluster_job_table, and<br>
tres_alloc has no "2=" (memory) value for any job.<br>
<br>
>From my reading of <a href="https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Ftres.html&data=04%7C01%7Cdwc62%40drexel.edu%7C98efffa860f64c58bfa408d9d5f03fe4%7C3664e6fa47bd45a696708c4f080f8ca6%7C0%7C1%7C637776050108044394%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=rlTTua04KSGUUrK7X8%2FJ7ce1tLv5%2BrdfIkvpSc%2BxsRw%3D&reserved=0" data-auth="NotApplicable">
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Ftres.html&data=04%7C01%7Cdwc62%40drexel.edu%7C98efffa860f64c58bfa408d9d5f03fe4%7C3664e6fa47bd45a696708c4f080f8ca6%7C0%7C1%7C637776050108044394%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=rlTTua04KSGUUrK7X8%2FJ7ce1tLv5%2BrdfIkvpSc%2BxsRw%3D&reserved=0</a>
it's not possible<br>
to disable memory as a TRES, so I can't figure out what I'm missing<br>
here. The 20.11 cluster is running on Ubuntu 20.04 (vs CentOS 7 for the<br>
others), in case that makes any difference!<br>
<br>
Thanks in advance,<br>
Xand<br>
<br>
</div>
</span></font></div>
<br>
<p style="font-family:Calibri;font-size:10pt;color:#000000;margin:5pt;" align="Left">
Drexel Internal Data<br>
</p>
</body>
</html>