<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>

</head>

<body dir="ltr">

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

Slurm 19.05.3 (packaged by Bright). For the three running jobs, the total GrpTRESRunMins requested is 564480 CPU-minutes as shown by 'showjob', and their remaining usage that the limit would check against is less than that.</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

My download of your scripts dated to August 21, 2019, and I've just now done a clone of your repository to see if there were any differences. None that I see -- 'showuserlimits -u USER -A ACCOUNT -s cpu' returns "Limit = 1440000, current value = 1399895".</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

So I assume there's something lingering in the database from some jobs that already completed, but still get counted against the user's current requests.</div>

<div id="appendonsend"></div>

<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">

<br>

</div>

<hr tabindex="-1" style="display:inline-block; width:98%">

<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" color="#000000" style="font-size:11pt"><b>From:</b> Ole Holm Nielsen <Ole.H.Nielsen@fysik.dtu.dk><br>

<b>Sent:</b> Friday, May 8, 2020 9:27 AM<br>

<b>To:</b> slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com><br>

<b>Cc:</b> Renfro, Michael <Renfro@tntech.edu><br>

<b>Subject:</b> Re: [slurm-users] scontrol show assoc_mgr showing more resources in use than squeue</font>

<div> </div>

</div>

<div class="BodyFragment"><font size="2"><span style="font-size:11pt">

<div class="PlainText">Hi Michael,<br>

<br>

Yes, my Slurm tools use and trust the output of Slurm commands such as<br>

sacct, and any discrepancy would have to come from the Slurm database.<br>

Which version of Slurm are you running on the database server and the<br>

node where you run sacct?<br>

<br>

Did you add up the GrpTRESRunMins values of all the user's running jobs?<br>

  They had better add up to current value = 1402415.  The "showjob"<br>

command prints #CPUs and time limit in minutes, so you need to multiply<br>

these numbers together.  Example:<br>

<br>

This job requests 160 CPUs and has a time limit of 2-00:00:00<br>

(days-hh:mm:ss) = 2880 min.<br>

<br>

Did you download the latest versions of my Slurm tools from Github?  I<br>

make improvements of them from time to time.<br>

<br>

/Ole<br>

<br>

<br>

On 08-05-2020 16:12, Renfro, Michael wrote:<br>

> Thanks, Ole. Your showuserlimits script is actually where I got started<br>

> today, and where I found the sacct command I sent earlier.<br>

><br>

> Your script gives the same output for that user: the only line that's<br>

> not a "Limit = None" is for the user's GrpTRESRunMins value, which is<br>

> at "Limit = 1440000, current value = 1402415".<br>

><br>

> The limit value is correct, but the current value is not (due to the<br>

> incorrect sacct output).<br>

><br>

> I've also gone through sacctmgr show runaway to clean up any runaway<br>

> jobs. I had lots, but they were all from a different user, and had no<br>

> effect on this particular user's values.<br>

><br>

> ------------------------------------------------------------------------<br>

> *From:* slurm-users <slurm-users-bounces@lists.schedmd.com> on behalf of<br>

> Ole Holm Nielsen <Ole.H.Nielsen@fysik.dtu.dk><br>

> *Sent:* Friday, May 8, 2020 8:54 AM<br>

> *To:* slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com><br>

> *Subject:* Re: [slurm-users] scontrol show assoc_mgr showing more<br>

> resources in use than squeue<br>

><br>

> Hi Michael,<br>

><br>

> Maybe you will find a couple of my Slurm tools useful for displaying<br>

> data from the Slurm database in a more user-friendly format:<br>

><br>

> showjob: Show status of Slurm job(s). Both queue information and<br>

> accounting information is printed.<br>

><br>

> showuserlimits: Print Slurm resource user limits and usage<br>

><br>

> The user's limits are printed in detail by showuserlimits.<br>

><br>

> These tools are available from <a href="https://github.com/OleHolmNielsen/Slurm_tools">

https://github.com/OleHolmNielsen/Slurm_tools</a><br>

><br>

> /Ole<br>

><br>

> On 08-05-2020 15:34, Renfro, Michael wrote:<br>

>> Hey, folks. I've had a 1000 CPU-day (1440000 CPU-minutes) GrpTRESMins<br>

>> limit applied to each user for years. It generally works as intended,<br>

>> but I have one user I've noticed whose usage is highly inflated from<br>

>> reality, causing the GrpTRESMins limit to be enforced much earlier than<br>

>> necessary:<br>

>><br>

>> squeue output, showing roughly 340 CPU-days in running jobs, and all<br>

>> other jobs blocked:<br>

>><br>

>> # squeue -u USER<br>

>> JOBID  PARTI       NAME     USER ST         TIME CPUS NODES<br>

>> NODELIST(REASON) PRIORITY TRES_P START_TIME           TIME_LEFT<br>

>> 747436 batch        job     USER PD         0:00 28   1<br>

>> (AssocGrpCPURunM 4784     N/A    N/A                  10-00:00:00<br>

>> 747437 batch        job     USER PD         0:00 28   1<br>

>> (AssocGrpCPURunM 4784     N/A    N/A                  4-04:00:00<br>

>> 747438 batch        job     USER PD         0:00 28   1<br>

>> (AssocGrpCPURunM 4784     N/A    N/A                  10-00:00:00<br>

>> 747439 batch        job     USER PD         0:00 28   1<br>

>> (AssocGrpCPURunM 4784     N/A    N/A                  4-04:00:00<br>

>> 747440 batch        job     USER PD         0:00 28   1<br>

>> (AssocGrpCPURunM 4784     N/A    N/A                  10-00:00:00<br>

>> 747441 batch        job     USER PD         0:00 28   1<br>

>> (AssocGrpCPURunM 4784     N/A    N/A                  4-14:00:00<br>

>> 747442 batch        job     USER PD         0:00 28   1<br>

>> (AssocGrpCPURunM 4784     N/A    N/A                  10-00:00:00<br>

>> 747446 batch        job     USER PD         0:00 14   1<br>

>> (AssocGrpCPURunM 4778     N/A    N/A                  4-00:00:00<br>

>> 747447 batch        job     USER PD         0:00 14   1<br>

>> (AssocGrpCPURunM 4778     N/A    N/A                  4-00:00:00<br>

>> 747448 batch        job     USER PD         0:00 14   1<br>

>> (AssocGrpCPURunM 4778     N/A    N/A                  4-00:00:00<br>

>> 747445 batch        job     USER  R      8:39:17 14   1     node002<br>

>>       4778     N/A    2020-05-07T23:02:19  3-15:20:43<br>

>> 747444 batch        job     USER  R     16:03:13 14   1     node003<br>

>>       4515     N/A    2020-05-07T15:38:23  3-07:56:47<br>

>> 747435 batch        job     USER  R   1-10:07:42 28   1     node005<br>

>>       3784     N/A    2020-05-06T21:33:54  8-13:52:18<br>

>><br>

>> scontrol output, showing roughly 980 CPU-days in use on the second line,<br>

>> and thus blocking additional jobs:<br>

>><br>

>> # scontrol -o show assoc_mgr users=USER account=ACCOUNT flags=assoc<br>

>> ClusterName=its Account=ACCOUNT UserName= Partition= Priority=0 ID=21<br>

>> SharesRaw/Norm/Level/Factor=1/0.03/35/0.00<br>

>> UsageRaw/Norm/Efctv=2733615872.34/0.39/0.71 ParentAccount=PARENT(9)<br>

>> Lft=1197 DefAssoc=No GrpJobs=N(4) GrpJobsAccrue=N(10)<br>

>> GrpSubmitJobs=N(14) GrpWall=N(616142.94)<br>

>> GrpTRES=cpu=N(84),mem=N(168000),energy=N(0),node=N(40),billing=N(420),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)<br>

>> GrpTRESMins=cpu=N(9239391),mem=N(18478778157),energy=N(0),node=N(616142),billing=N(45546470),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)<br>

>> GrpTRESRunMins=cpu=N(1890060),mem=N(3780121866),energy=N(0),node=N(113778),billing=N(9450304),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)<br>

>> MaxJobs= MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN=<br>

>> MaxTRESMinsPJ= MinPrioThresh=<br>

>> ClusterName=its Account=ACCOUNT UserName=USER(UID) Partition= Priority=0<br>

>> ID=56 SharesRaw/Norm/Level/Factor=1/0.08/13/0.00<br>

>> UsageRaw/Norm/Efctv=994969457.37/0.14/0.36 ParentAccount= Lft=1218<br>

>> DefAssoc=Yes GrpJobs=N(3) GrpJobsAccrue=N(10) GrpSubmitJobs=N(13)<br>

>> GrpWall=N(227625.69)<br>

>> GrpTRES=cpu=N(56),mem=N(112000),energy=N(0),node=N(35),billing=N(280),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=8(0)<br>

>> GrpTRESMins=cpu=N(3346095),mem=N(6692190572),energy=N(0),node=N(227625),billing=N(16580497),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)<br>

>> GrpTRESRunMins=cpu=1440000(1407455),mem=N(2814910466),energy=N(0),node=N(88171),billing=N(7037276),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)<br>

>> MaxJobs= MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN=<br>

>> MaxTRESMinsPJ= MinPrioThresh=<br>

>><br>

>> Where can I investigate to find the cause of this difference? Thanks.<br>

</div>

</span></font></div>

</body>

</html>