[slurm-users] Memory usage not tracked

Wed Apr 6 19:08:37 UTC 2022

Hi, Xand:

How does adding "ReqMem" to the sacct change the output?

E.g. on my cluster running Slurm 20.02.7 (on RHEL8), our GPU nodes have TRESBillingWeights=CPU=0,Mem=0,GRES/gpu=43:

$ sacct --format=JobID%25,State,AllocTRES%50,ReqTRES,ReqMem,ReqCPUS|grep RUNNING

                    JobID      State                                          AllocTRES    ReqTRES     ReqMem  ReqCPUS
------------------------- ---------- -------------------------------------------------- ---------- ---------- --------
            2512977.batch    RUNNING                                cpu=48,mem=0,node=1                    0n       48
           2512977.extern    RUNNING             billing=516,cpu=144,gres/gpu=12,node=3                    0n      144
                2512977.0    RUNNING     cpu=24,gres/gpu:v100=8,gres/gpu=8,mem=0,node=2                    0n       24
                  2513020    RUNNING             billing=516,cpu=144,gres/gpu=12,node=3 billing=5+         0n      144

I.e. note the "mem=0", and absence of the mem field on some of those lines. In squeue:

       JOBID PART             NAME     USER    STATE       TIME  TIME_LIMIT  NODES MIN_MEMO NODELIST(REASON)
     2512977  gpu 1AB_96DMPCLoose_    ba553  RUNNING   22:29:20  1-00:00:00      3        0 gpu[001,003-004]

In comparison, a job on our def partition which requests a specific amount of mem:

(sacct)
                    JobID      State                                          AllocTRES    ReqTRES     ReqMem  ReqCPUS
------------------------- ---------- -------------------------------------------------- ---------- ---------- --------
                  2514854    RUNNING                     billing=1,cpu=1,mem=36G,node=1 billing=1+       36Gn        1
            2514854.batch    RUNNING                               cpu=1,mem=36G,node=1                  36Gn        1
           2514854.extern    RUNNING                     billing=1,cpu=1,mem=36G,node=1                  36Gn        1

and the squeue line:

       JOBID PART             NAME     USER    STATE       TIME  TIME_LIMIT  NODES MIN_MEMO NODELIST(REASON)
     2514854  def ClusterJobStart_ sbradley  RUNNING    5:05:27     8:00:00      1      36G node003

--
David Chin, PhD (he/him)   Sr. SysAdmin, URCF, Drexel
dwc62 at drexel.edu                     215.571.4335 (o)
For URCF support: urcf-support at drexel.edu
https://proteusmaster.urcf.drexel.edu/urcfwiki
github:prehensilecode
________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Xand Meaden <xand.meaden at kcl.ac.uk>
Sent: Wednesday, January 12, 2022 12:23
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: [slurm-users] Memory usage not tracked

External.

Hi,

We wish to record memory usage of HPC jobs, but with Slurm 20.11 cannot
get this to work - the information is simply missing. Our two older
clusters with Slurm 19.05 will record memory usage as a TRES, e.g. as
shown below:

# sacct --format=JobID,State,AllocTRES%32|grep RUNNING|head -4
14029267        RUNNING billing=32,cpu=32,mem=185600M,n+
14037739        RUNNING billing=64,cpu=64,mem=250G,node+
14037739.ba+    RUNNING           cpu=32,mem=125G,node=1
14037739.0      RUNNING           cpu=1,mem=4000M,node=1

However with 20.11 we see no memory usage:

# sacct --format=JobID,State,AllocTRES%32|grep RUNNING|head -4
771             RUNNING         billing=36,cpu=36,node=1
771.batch       RUNNING              cpu=36,mem=0,node=1
816             RUNNING       billing=128,cpu=128,node=1
823             RUNNING         billing=36,cpu=36,node=1

I've also checked within the slurm database's cluster_job_table, and
tres_alloc has no "2=" (memory) value for any job.

>From my reading of https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Ftres.html&data=04%7C01%7Cdwc62%40drexel.edu%7C98efffa860f64c58bfa408d9d5f03fe4%7C3664e6fa47bd45a696708c4f080f8ca6%7C0%7C1%7C637776050108044394%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=rlTTua04KSGUUrK7X8%2FJ7ce1tLv5%2BrdfIkvpSc%2BxsRw%3D&reserved=0 it's not possible
to disable memory as a TRES, so I can't figure out what I'm missing
here. The 20.11 cluster is running on Ubuntu 20.04 (vs CentOS 7 for the
others), in case that makes any difference!

Thanks in advance,
Xand

Drexel Internal Data
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220406/9a341746/attachment-0001.htm>