<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Thanks, Ole. Your showuserlimits script is actually where I got started today, and where I found the sacct command I sent earlier.</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Your script gives the same output for that user: the only line that's not a "Limit = None" is for the user's GrpTRESRunMins value, which is at "Limit = 1440000, current value = 1402415".</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
The limit value is correct, but the current value is not (due to the incorrect sacct output).</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
I've also gone through sacctmgr show runaway to clean up any runaway jobs. I had lots, but they were all from a different user, and had no effect on this particular user's values.</div>
<div id="appendonsend"></div>
<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">
<br>
</div>
<hr tabindex="-1" style="display:inline-block; width:98%">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> slurm-users <slurm-users-bounces@lists.schedmd.com> on behalf of Ole Holm Nielsen <Ole.H.Nielsen@fysik.dtu.dk><br>
<b>Sent:</b> Friday, May 8, 2020 8:54 AM<br>
<b>To:</b> slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com><br>
<b>Subject:</b> Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue</font><span style="background: var(--white);"> </span>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt">
<div class="PlainText"><br>
Hi Michael,<br>
<br>
Maybe you will find a couple of my Slurm tools useful for displaying<br>
data from the Slurm database in a more user-friendly format:<br>
<br>
showjob: Show status of Slurm job(s). Both queue information and<br>
accounting information is printed.<br>
<br>
showuserlimits: Print Slurm resource user limits and usage<br>
<br>
The user's limits are printed in detail by showuserlimits.<br>
<br>
These tools are available from <a href="https://github.com/OleHolmNielsen/Slurm_tools">
https://github.com/OleHolmNielsen/Slurm_tools</a><br>
<br>
/Ole<br>
<br>
On 08-05-2020 15:34, Renfro, Michael wrote:<br>
> Hey, folks. I've had a 1000 CPU-day (1440000 CPU-minutes) GrpTRESMins<br>
> limit applied to each user for years. It generally works as intended,<br>
> but I have one user I've noticed whose usage is highly inflated from<br>
> reality, causing the GrpTRESMins limit to be enforced much earlier than<br>
> necessary:<br>
><br>
> squeue output, showing roughly 340 CPU-days in running jobs, and all<br>
> other jobs blocked:<br>
><br>
> # squeue -u USER<br>
> JOBID PARTI NAME USER ST TIME CPUS NODES<br>
> NODELIST(REASON) PRIORITY TRES_P START_TIME TIME_LEFT<br>
> 747436 batch job USER PD 0:00 28 1<br>
> (AssocGrpCPURunM 4784 N/A N/A 10-00:00:00<br>
> 747437 batch job USER PD 0:00 28 1<br>
> (AssocGrpCPURunM 4784 N/A N/A 4-04:00:00<br>
> 747438 batch job USER PD 0:00 28 1<br>
> (AssocGrpCPURunM 4784 N/A N/A 10-00:00:00<br>
> 747439 batch job USER PD 0:00 28 1<br>
> (AssocGrpCPURunM 4784 N/A N/A 4-04:00:00<br>
> 747440 batch job USER PD 0:00 28 1<br>
> (AssocGrpCPURunM 4784 N/A N/A 10-00:00:00<br>
> 747441 batch job USER PD 0:00 28 1<br>
> (AssocGrpCPURunM 4784 N/A N/A 4-14:00:00<br>
> 747442 batch job USER PD 0:00 28 1<br>
> (AssocGrpCPURunM 4784 N/A N/A 10-00:00:00<br>
> 747446 batch job USER PD 0:00 14 1<br>
> (AssocGrpCPURunM 4778 N/A N/A 4-00:00:00<br>
> 747447 batch job USER PD 0:00 14 1<br>
> (AssocGrpCPURunM 4778 N/A N/A 4-00:00:00<br>
> 747448 batch job USER PD 0:00 14 1<br>
> (AssocGrpCPURunM 4778 N/A N/A 4-00:00:00<br>
> 747445 batch job USER R 8:39:17 14 1 node002<br>
> 4778 N/A 2020-05-07T23:02:19 3-15:20:43<br>
> 747444 batch job USER R 16:03:13 14 1 node003<br>
> 4515 N/A 2020-05-07T15:38:23 3-07:56:47<br>
> 747435 batch job USER R 1-10:07:42 28 1 node005<br>
> 3784 N/A 2020-05-06T21:33:54 8-13:52:18<br>
><br>
> scontrol output, showing roughly 980 CPU-days in use on the second line,<br>
> and thus blocking additional jobs:<br>
><br>
> # scontrol -o show assoc_mgr users=USER account=ACCOUNT flags=assoc<br>
> ClusterName=its Account=ACCOUNT UserName= Partition= Priority=0 ID=21<br>
> SharesRaw/Norm/Level/Factor=1/0.03/35/0.00<br>
> UsageRaw/Norm/Efctv=2733615872.34/0.39/0.71 ParentAccount=PARENT(9)<br>
> Lft=1197 DefAssoc=No GrpJobs=N(4) GrpJobsAccrue=N(10)<br>
> GrpSubmitJobs=N(14) GrpWall=N(616142.94)<br>
> GrpTRES=cpu=N(84),mem=N(168000),energy=N(0),node=N(40),billing=N(420),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)<br>
> GrpTRESMins=cpu=N(9239391),mem=N(18478778157),energy=N(0),node=N(616142),billing=N(45546470),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)<br>
> GrpTRESRunMins=cpu=N(1890060),mem=N(3780121866),energy=N(0),node=N(113778),billing=N(9450304),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)<br>
> MaxJobs= MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN=<br>
> MaxTRESMinsPJ= MinPrioThresh=<br>
> ClusterName=its Account=ACCOUNT UserName=USER(UID) Partition= Priority=0<br>
> ID=56 SharesRaw/Norm/Level/Factor=1/0.08/13/0.00<br>
> UsageRaw/Norm/Efctv=994969457.37/0.14/0.36 ParentAccount= Lft=1218<br>
> DefAssoc=Yes GrpJobs=N(3) GrpJobsAccrue=N(10) GrpSubmitJobs=N(13)<br>
> GrpWall=N(227625.69)<br>
> GrpTRES=cpu=N(56),mem=N(112000),energy=N(0),node=N(35),billing=N(280),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=8(0)<br>
> GrpTRESMins=cpu=N(3346095),mem=N(6692190572),energy=N(0),node=N(227625),billing=N(16580497),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)<br>
> GrpTRESRunMins=cpu=1440000(1407455),mem=N(2814910466),energy=N(0),node=N(88171),billing=N(7037276),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)<br>
> MaxJobs= MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN=<br>
> MaxTRESMinsPJ= MinPrioThresh=<br>
><br>
> Where can I investigate to find the cause of this difference? Thanks.<br>
<br>
<br>
</div>
</span></font></div>
</body>
</html>