[slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

Fri May 8 14:12:21 UTC 2020

Thanks, Ole. Your showuserlimits script is actually where I got started today, and where I found the sacct command I sent earlier.

Your script gives the same output for that user: the only line that's not a "Limit = None" is for the user's GrpTRESRunMins value, which is at "Limit = 1440000, current value = 1402415".

The limit value is correct, but the current value is not (due to the incorrect sacct output).

I've also gone through sacctmgr show runaway to clean up any runaway jobs. I had lots, but they were all from a different user, and had no effect on this particular user's values.

________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk>
Sent: Friday, May 8, 2020 8:54 AM
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue

Hi Michael,

Maybe you will find a couple of my Slurm tools useful for displaying
data from the Slurm database in a more user-friendly format:

showjob: Show status of Slurm job(s). Both queue information and
accounting information is printed.

showuserlimits: Print Slurm resource user limits and usage

The user's limits are printed in detail by showuserlimits.

These tools are available from https://github.com/OleHolmNielsen/Slurm_tools

/Ole

On 08-05-2020 15:34, Renfro, Michael wrote:
> Hey, folks. I've had a 1000 CPU-day (1440000 CPU-minutes) GrpTRESMins
> limit applied to each user for years. It generally works as intended,
> but I have one user I've noticed whose usage is highly inflated from
> reality, causing the GrpTRESMins limit to be enforced much earlier than
> necessary:
>
> squeue output, showing roughly 340 CPU-days in running jobs, and all
> other jobs blocked:
>
> # squeue -u USER
> JOBID  PARTI       NAME     USER ST         TIME CPUS NODES
> NODELIST(REASON) PRIORITY TRES_P START_TIME           TIME_LEFT
> 747436 batch        job     USER PD         0:00 28   1
> (AssocGrpCPURunM 4784     N/A    N/A                  10-00:00:00
> 747437 batch        job     USER PD         0:00 28   1
> (AssocGrpCPURunM 4784     N/A    N/A                  4-04:00:00
> 747438 batch        job     USER PD         0:00 28   1
> (AssocGrpCPURunM 4784     N/A    N/A                  10-00:00:00
> 747439 batch        job     USER PD         0:00 28   1
> (AssocGrpCPURunM 4784     N/A    N/A                  4-04:00:00
> 747440 batch        job     USER PD         0:00 28   1
> (AssocGrpCPURunM 4784     N/A    N/A                  10-00:00:00
> 747441 batch        job     USER PD         0:00 28   1
> (AssocGrpCPURunM 4784     N/A    N/A                  4-14:00:00
> 747442 batch        job     USER PD         0:00 28   1
> (AssocGrpCPURunM 4784     N/A    N/A                  10-00:00:00
> 747446 batch        job     USER PD         0:00 14   1
> (AssocGrpCPURunM 4778     N/A    N/A                  4-00:00:00
> 747447 batch        job     USER PD         0:00 14   1
> (AssocGrpCPURunM 4778     N/A    N/A                  4-00:00:00
> 747448 batch        job     USER PD         0:00 14   1
> (AssocGrpCPURunM 4778     N/A    N/A                  4-00:00:00
> 747445 batch        job     USER  R      8:39:17 14   1     node002
>       4778     N/A    2020-05-07T23:02:19  3-15:20:43
> 747444 batch        job     USER  R     16:03:13 14   1     node003
>       4515     N/A    2020-05-07T15:38:23  3-07:56:47
> 747435 batch        job     USER  R   1-10:07:42 28   1     node005
>       3784     N/A    2020-05-06T21:33:54  8-13:52:18
>
> scontrol output, showing roughly 980 CPU-days in use on the second line,
> and thus blocking additional jobs:
>
> # scontrol -o show assoc_mgr users=USER account=ACCOUNT flags=assoc
> ClusterName=its Account=ACCOUNT UserName= Partition= Priority=0 ID=21
> SharesRaw/Norm/Level/Factor=1/0.03/35/0.00
> UsageRaw/Norm/Efctv=2733615872.34/0.39/0.71 ParentAccount=PARENT(9)
> Lft=1197 DefAssoc=No GrpJobs=N(4) GrpJobsAccrue=N(10)
> GrpSubmitJobs=N(14) GrpWall=N(616142.94)
> GrpTRES=cpu=N(84),mem=N(168000),energy=N(0),node=N(40),billing=N(420),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
> GrpTRESMins=cpu=N(9239391),mem=N(18478778157),energy=N(0),node=N(616142),billing=N(45546470),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
> GrpTRESRunMins=cpu=N(1890060),mem=N(3780121866),energy=N(0),node=N(113778),billing=N(9450304),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
> MaxJobs= MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN=
> MaxTRESMinsPJ= MinPrioThresh=
> ClusterName=its Account=ACCOUNT UserName=USER(UID) Partition= Priority=0
> ID=56 SharesRaw/Norm/Level/Factor=1/0.08/13/0.00
> UsageRaw/Norm/Efctv=994969457.37/0.14/0.36 ParentAccount= Lft=1218
> DefAssoc=Yes GrpJobs=N(3) GrpJobsAccrue=N(10) GrpSubmitJobs=N(13)
> GrpWall=N(227625.69)
> GrpTRES=cpu=N(56),mem=N(112000),energy=N(0),node=N(35),billing=N(280),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=8(0)
> GrpTRESMins=cpu=N(3346095),mem=N(6692190572),energy=N(0),node=N(227625),billing=N(16580497),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
> GrpTRESRunMins=cpu=1440000(1407455),mem=N(2814910466),energy=N(0),node=N(88171),billing=N(7037276),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
> MaxJobs= MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN=
> MaxTRESMinsPJ= MinPrioThresh=
>
> Where can I investigate to find the cause of this difference? Thanks.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200508/7151dce9/attachment-0001.htm>