[slurm-users] Memory usage not tracked
Chin,David
dwc62 at drexel.edu
Wed Apr 6 19:08:37 UTC 2022
Hi, Xand:
How does adding "ReqMem" to the sacct change the output?
E.g. on my cluster running Slurm 20.02.7 (on RHEL8), our GPU nodes have TRESBillingWeights=CPU=0,Mem=0,GRES/gpu=43:
$ sacct --format=JobID%25,State,AllocTRES%50,ReqTRES,ReqMem,ReqCPUS|grep RUNNING
JobID State AllocTRES ReqTRES ReqMem ReqCPUS
------------------------- ---------- -------------------------------------------------- ---------- ---------- --------
2512977.batch RUNNING cpu=48,mem=0,node=1 0n 48
2512977.extern RUNNING billing=516,cpu=144,gres/gpu=12,node=3 0n 144
2512977.0 RUNNING cpu=24,gres/gpu:v100=8,gres/gpu=8,mem=0,node=2 0n 24
2513020 RUNNING billing=516,cpu=144,gres/gpu=12,node=3 billing=5+ 0n 144
I.e. note the "mem=0", and absence of the mem field on some of those lines. In squeue:
JOBID PART NAME USER STATE TIME TIME_LIMIT NODES MIN_MEMO NODELIST(REASON)
2512977 gpu 1AB_96DMPCLoose_ ba553 RUNNING 22:29:20 1-00:00:00 3 0 gpu[001,003-004]
In comparison, a job on our def partition which requests a specific amount of mem:
(sacct)
JobID State AllocTRES ReqTRES ReqMem ReqCPUS
------------------------- ---------- -------------------------------------------------- ---------- ---------- --------
2514854 RUNNING billing=1,cpu=1,mem=36G,node=1 billing=1+ 36Gn 1
2514854.batch RUNNING cpu=1,mem=36G,node=1 36Gn 1
2514854.extern RUNNING billing=1,cpu=1,mem=36G,node=1 36Gn 1
and the squeue line:
JOBID PART NAME USER STATE TIME TIME_LIMIT NODES MIN_MEMO NODELIST(REASON)
2514854 def ClusterJobStart_ sbradley RUNNING 5:05:27 8:00:00 1 36G node003
--
David Chin, PhD (he/him) Sr. SysAdmin, URCF, Drexel
dwc62 at drexel.edu 215.571.4335 (o)
For URCF support: urcf-support at drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode
________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Xand Meaden <xand.meaden at kcl.ac.uk>
Sent: Wednesday, January 12, 2022 12:23
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: [slurm-users] Memory usage not tracked
External.
Hi,
We wish to record memory usage of HPC jobs, but with Slurm 20.11 cannot
get this to work - the information is simply missing. Our two older
clusters with Slurm 19.05 will record memory usage as a TRES, e.g. as
shown below:
# sacct --format=JobID,State,AllocTRES%32|grep RUNNING|head -4
14029267 RUNNING billing=32,cpu=32,mem=185600M,n+
14037739 RUNNING billing=64,cpu=64,mem=250G,node+
14037739.ba+ RUNNING cpu=32,mem=125G,node=1
14037739.0 RUNNING cpu=1,mem=4000M,node=1
However with 20.11 we see no memory usage:
# sacct --format=JobID,State,AllocTRES%32|grep RUNNING|head -4
771 RUNNING billing=36,cpu=36,node=1
771.batch RUNNING cpu=36,mem=0,node=1
816 RUNNING billing=128,cpu=128,node=1
823 RUNNING billing=36,cpu=36,node=1
I've also checked within the slurm database's cluster_job_table, and
tres_alloc has no "2=" (memory) value for any job.
>From my reading of https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Ftres.html&data=04%7C01%7Cdwc62%40drexel.edu%7C98efffa860f64c58bfa408d9d5f03fe4%7C3664e6fa47bd45a696708c4f080f8ca6%7C0%7C1%7C637776050108044394%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=rlTTua04KSGUUrK7X8%2FJ7ce1tLv5%2BrdfIkvpSc%2BxsRw%3D&reserved=0 it's not possible
to disable memory as a TRES, so I can't figure out what I'm missing
here. The 20.11 cluster is running on Ubuntu 20.04 (vs CentOS 7 for the
others), in case that makes any difference!
Thanks in advance,
Xand
Drexel Internal Data
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220406/9a341746/attachment-0001.htm>
More information about the slurm-users
mailing list