[slurm-users] scontrol show assoc_mgr showing more resources in use than squeue
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Fri May 8 14:27:10 UTC 2020
Hi Michael,
Yes, my Slurm tools use and trust the output of Slurm commands such as
sacct, and any discrepancy would have to come from the Slurm database.
Which version of Slurm are you running on the database server and the
node where you run sacct?
Did you add up the GrpTRESRunMins values of all the user's running jobs?
They had better add up to current value = 1402415. The "showjob"
command prints #CPUs and time limit in minutes, so you need to multiply
these numbers together. Example:
This job requests 160 CPUs and has a time limit of 2-00:00:00
(days-hh:mm:ss) = 2880 min.
Did you download the latest versions of my Slurm tools from Github? I
make improvements of them from time to time.
/Ole
On 08-05-2020 16:12, Renfro, Michael wrote:
> Thanks, Ole. Your showuserlimits script is actually where I got started
> today, and where I found the sacct command I sent earlier.
>
> Your script gives the same output for that user: the only line that's
> not a "Limit = None" is for the user's GrpTRESRunMins value, which is
> at "Limit = 1440000, current value = 1402415".
>
> The limit value is correct, but the current value is not (due to the
> incorrect sacct output).
>
> I've also gone through sacctmgr show runaway to clean up any runaway
> jobs. I had lots, but they were all from a different user, and had no
> effect on this particular user's values.
>
> ------------------------------------------------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk>
> *Sent:* Friday, May 8, 2020 8:54 AM
> *To:* slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
> *Subject:* Re: [slurm-users] scontrol show assoc_mgr showing more
> resources in use than squeue
>
> Hi Michael,
>
> Maybe you will find a couple of my Slurm tools useful for displaying
> data from the Slurm database in a more user-friendly format:
>
> showjob: Show status of Slurm job(s). Both queue information and
> accounting information is printed.
>
> showuserlimits: Print Slurm resource user limits and usage
>
> The user's limits are printed in detail by showuserlimits.
>
> These tools are available from https://github.com/OleHolmNielsen/Slurm_tools
>
> /Ole
>
> On 08-05-2020 15:34, Renfro, Michael wrote:
>> Hey, folks. I've had a 1000 CPU-day (1440000 CPU-minutes) GrpTRESMins
>> limit applied to each user for years. It generally works as intended,
>> but I have one user I've noticed whose usage is highly inflated from
>> reality, causing the GrpTRESMins limit to be enforced much earlier than
>> necessary:
>>
>> squeue output, showing roughly 340 CPU-days in running jobs, and all
>> other jobs blocked:
>>
>> # squeue -u USER
>> JOBID PARTI NAME USER ST TIME CPUS NODES
>> NODELIST(REASON) PRIORITY TRES_P START_TIME TIME_LEFT
>> 747436 batch job USER PD 0:00 28 1
>> (AssocGrpCPURunM 4784 N/A N/A 10-00:00:00
>> 747437 batch job USER PD 0:00 28 1
>> (AssocGrpCPURunM 4784 N/A N/A 4-04:00:00
>> 747438 batch job USER PD 0:00 28 1
>> (AssocGrpCPURunM 4784 N/A N/A 10-00:00:00
>> 747439 batch job USER PD 0:00 28 1
>> (AssocGrpCPURunM 4784 N/A N/A 4-04:00:00
>> 747440 batch job USER PD 0:00 28 1
>> (AssocGrpCPURunM 4784 N/A N/A 10-00:00:00
>> 747441 batch job USER PD 0:00 28 1
>> (AssocGrpCPURunM 4784 N/A N/A 4-14:00:00
>> 747442 batch job USER PD 0:00 28 1
>> (AssocGrpCPURunM 4784 N/A N/A 10-00:00:00
>> 747446 batch job USER PD 0:00 14 1
>> (AssocGrpCPURunM 4778 N/A N/A 4-00:00:00
>> 747447 batch job USER PD 0:00 14 1
>> (AssocGrpCPURunM 4778 N/A N/A 4-00:00:00
>> 747448 batch job USER PD 0:00 14 1
>> (AssocGrpCPURunM 4778 N/A N/A 4-00:00:00
>> 747445 batch job USER R 8:39:17 14 1 node002
>> 4778 N/A 2020-05-07T23:02:19 3-15:20:43
>> 747444 batch job USER R 16:03:13 14 1 node003
>> 4515 N/A 2020-05-07T15:38:23 3-07:56:47
>> 747435 batch job USER R 1-10:07:42 28 1 node005
>> 3784 N/A 2020-05-06T21:33:54 8-13:52:18
>>
>> scontrol output, showing roughly 980 CPU-days in use on the second line,
>> and thus blocking additional jobs:
>>
>> # scontrol -o show assoc_mgr users=USER account=ACCOUNT flags=assoc
>> ClusterName=its Account=ACCOUNT UserName= Partition= Priority=0 ID=21
>> SharesRaw/Norm/Level/Factor=1/0.03/35/0.00
>> UsageRaw/Norm/Efctv=2733615872.34/0.39/0.71 ParentAccount=PARENT(9)
>> Lft=1197 DefAssoc=No GrpJobs=N(4) GrpJobsAccrue=N(10)
>> GrpSubmitJobs=N(14) GrpWall=N(616142.94)
>> GrpTRES=cpu=N(84),mem=N(168000),energy=N(0),node=N(40),billing=N(420),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
>> GrpTRESMins=cpu=N(9239391),mem=N(18478778157),energy=N(0),node=N(616142),billing=N(45546470),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
>> GrpTRESRunMins=cpu=N(1890060),mem=N(3780121866),energy=N(0),node=N(113778),billing=N(9450304),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
>> MaxJobs= MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN=
>> MaxTRESMinsPJ= MinPrioThresh=
>> ClusterName=its Account=ACCOUNT UserName=USER(UID) Partition= Priority=0
>> ID=56 SharesRaw/Norm/Level/Factor=1/0.08/13/0.00
>> UsageRaw/Norm/Efctv=994969457.37/0.14/0.36 ParentAccount= Lft=1218
>> DefAssoc=Yes GrpJobs=N(3) GrpJobsAccrue=N(10) GrpSubmitJobs=N(13)
>> GrpWall=N(227625.69)
>> GrpTRES=cpu=N(56),mem=N(112000),energy=N(0),node=N(35),billing=N(280),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=8(0)
>> GrpTRESMins=cpu=N(3346095),mem=N(6692190572),energy=N(0),node=N(227625),billing=N(16580497),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
>> GrpTRESRunMins=cpu=1440000(1407455),mem=N(2814910466),energy=N(0),node=N(88171),billing=N(7037276),fs/disk=N(0),vmem=N(0),pages=N(0),gres/gpu=N(0)
>> MaxJobs= MaxJobsAccrue= MaxSubmitJobs= MaxWallPJ= MaxTRESPJ= MaxTRESPN=
>> MaxTRESMinsPJ= MinPrioThresh=
>>
>> Where can I investigate to find the cause of this difference? Thanks.
More information about the slurm-users
mailing list