[slurm-users] scontrol show assoc_mgr showing more resources in use than squeue
Renfro, Michael
Renfro at tntech.edu
Fri May 8 13:34:11 UTC 2020
Hey, folks. I've had a 1000 CPU-day (1440000 CPU-minutes) GrpTRESMins limit applied to each user for years. It generally works as intended, but I have one user I've noticed whose usage is highly inflated from reality, causing the GrpTRESMins limit to be enforced much earlier than necessary:
squeue output, showing roughly 340 CPU-days in running jobs, and all other jobs blocked:
# squeue -u USER
JOBID PARTI NAME USER ST TIME CPUS NODES NODELIST(REASON) PRIORITY TRES_P START_TIME TIME_LEFT
747436 batch job USER PD 0:00 28 1 (AssocGrpCPURunM 4784 N/A N/A 10-00:00:00
747437 batch job USER PD 0:00 28 1 (AssocGrpCPURunM 4784 N/A N/A 4-04:00:00
747438 batch job USER PD 0:00 28 1 (AssocGrpCPURunM 4784 N/A N/A 10-00:00:00
747439 batch job USER PD 0:00 28 1 (AssocGrpCPURunM 4784 N/A N/A 4-04:00:00
747440 batch job USER PD 0:00 28 1 (AssocGrpCPURunM 4784 N/A N/A 10-00:00:00
747441 batch job USER PD 0:00 28 1 (AssocGrpCPURunM 4784 N/A N/A 4-14:00:00
747442 batch job USER PD 0:00 28 1 (AssocGrpCPURunM 4784 N/A N/A 10-00:00:00
747446 batch job USER PD 0:00 14 1 (AssocGrpCPURunM 4778 N/A N/A 4-00:00:00
747447 batch job USER PD 0:00 14 1 (AssocGrpCPURunM 4778 N/A N/A 4-00:00:00
747448 batch job USER PD 0:00 14 1 (AssocGrpCPURunM 4778 N/A N/A 4-00:00:00
747445 batch job USER R 8:39:17 14 1 node002 4778 N/A 2020-05-07T23:02:19 3-15:20:43
747444 batch job USER R 16:03:13 14 1 node003 4515 N/A 2020-05-07T15:38:23 3-07:56:47
747435 batch job USER R 1-10:07:42 28 1 node005 3784 N/A 2020-05-06T21:33:54 8-13:52:18
scontrol output, showing roughly 980 CPU-days in use on the second line, and thus blocking additional jobs:
# scontrol -o show assoc_mgr users=USER account=ACCOUNT flags=assoc
ClusterName=its Account=ACCOUNT UserName= Partition= Priority=0 ID=21 SharesRaw/Norm/Level/Factor=1/0.03/35/0.00 UsageRaw/Norm/Efctv=2733615872.34/0.39/0.71 ParentAccount=PARENT(9) Lft=1197 DefAssoc=No GrpJobs=N(4) GrpJobsAccrue=N(10) GrpSubmitJobs=N(14) GrpWall=N(616142.94) GrpTRES=cpu=N(84),mem=N(168000),energy=N(0),node=N(40),billing=N(420),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0) GrpTRESMins=cpu=N(9239391),mem=N(18478778157),energy=N(0),node=N(616142),billing=N(45546470),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0) GrpTRESRunMins=cpu=N(1890060),mem=N(3780121866),energy=N(0),node=N(113778),billing=N(9450304),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0) MaxJobs= MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN= MaxTRESMinsPJ= MinPrioThresh=
ClusterName=its Account=ACCOUNT UserName=USER(UID) Partition= Priority=0 ID=56 SharesRaw/Norm/Level/Factor=1/0.08/13/0.00 UsageRaw/Norm/Efctv=994969457.37/0.14/0.36 ParentAccount= Lft=1218 DefAssoc=Yes GrpJobs=N(3) GrpJobsAccrue=N(10) GrpSubmitJobs=N(13) GrpWall=N(227625.69) GrpTRES=cpu=N(56),mem=N(112000),energy=N(0),node=N(35),billing=N(280),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=8(0) GrpTRESMins=cpu=N(3346095),mem=N(6692190572),energy=N(0),node=N(227625),billing=N(16580497),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0) GrpTRESRunMins=cpu=1440000(1407455),mem=N(2814910466),energy=N(0),node=N(88171),billing=N(7037276),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0) MaxJobs= MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN= MaxTRESMinsPJ= MinPrioThresh=
Where can I investigate to find the cause of this difference? Thanks.
--
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200508/fd2912fa/attachment-0001.htm>
More information about the slurm-users
mailing list