Also -- scontrol show nodes
-----Original Message----- From: Williams, Jenny Avis Sent: Thursday, March 14, 2024 6:46 PM To: Ole Holm Nielsen Ole.H.Nielsen@fysik.dtu.dk; slurm-users@lists.schedmd.com Subject: RE: [slurm-users] Re: Jobs being denied for GrpCpuLimit despite having enough resource
I use an alias slist = ` sed 's/ /\n/g' |sort|uniq` -- do not cp/paste lines with "--" -- it is not the two hyphens intended. The examples below are for slurm 23.02.7 . These commands assume administrator access.
This is a generalized set of areas I use to find why things just are not moving along. Either there is indeed a QOS being applied just not in the way you expect, the scheduler is bogged down and the pend reason is not updating, and the job Reason is different, or the scheduler is indeed "stuck" on something you just aren't seeing yet.
-- Hunt for all qos applied: either on user, account, or partition Let's not qualify the user sacct listing. if there are any fields that are non-empty in any of the entries include those in the format list-- For instance if there are limits or qos' at the account tier they may or may not come into play; if the user or group or partition have qos' applied they may or may not come into play, depending on e.g. if "parent" has been set. If there is a grptres in a qos applied to the partition, that grptres applies to the partition, not the "group"/account that at least I tend to assume is what I want or would wish for at times. Grp at the partition level means the jobs in aggregate in the partition. How that applies when users have jobs running in this partition and possibly others can get interesting.
So , if the pend reason is correct and in the end there is a QOS somewhere at play, look at all QOS that are in any way related at the partition,user and account levels.
scontrol show partition normal |slist |egrep -I "min|max|qos|oversubscribe|allow" sacctmgr list associations where account=users_acct sacctmgr list assoc where user=user
For any and all qos' that come from this output do sacctmgr listing so you see any fields with data. sacctmgr show qos where name=qosname format=etc.
-- resource competition
squeue -p normal* -t pd --Format=prioritylong,jobid,state,tres:50|sort -n -k1,2 Is there any job in this partition that is pending with reason "Resources?" Are the nodes shared with another partition, and if so, is there a job pending in that partition with reason Resources? *If you have more than 1 partition with the same nodes in them, list all partitions in the -p option as a comma separated list, not just the one the job is in. Any higher priority job will block lower priority jobs competing for the same resources if they are ineligible for backfill. Do that squeue of both to see the jobs competing for the same resources in regard to priority.
scontrol show job jobID |slist |egrep -I "oversubscribe|qos|schednodelist|tres|cmd|workdir|min" # typically I just do the scontrol show job jobID |slist then scan the list. That limit is hiding somewhere...
From that job QOS Oversubscribe (should say YES -- anything else, that is the reason -- they have added --exclusive set. SchedNodeList A favorite recent sticking point. A next job or next few jobs will take dibs on a node that the scheduler believes will become free but the node will show up as "idle" , so look for other jobs in the partition that have SchedNodeList set. Those are the jobs that are hanging onto the apparently idle jobs.
scontrol show partition normal |slist Give the full listing not qualified -- of particular interest are any fields that say e.g. Max or Min, qos, oversubscribe,priority Are all partitions the same priority, or do they vary. If there are other partitions that are higher priority they may be absorbing the scheduler resources, especially if there are short running jobs there.
-- There is a "snag" Look at "sdiag -r ; sleep 120; sdiag" Users running something like" watch squeue " or "watch sacct" can tank the responsiveness of the scheduler. Under the heading Remote Procedure Call statistics by user, any user that has on the order of the same count as root could be causing scheduler slowdowns.
Look at " sacct -S now-2minute -E now -a -s completed,failed -X --format=elapsed" -- if you have large numbers of "short" jobs in any partitions, where YMMV for what short means, the scheduler can be overwhelmed.
I hope this helps.
-----Original Message----- From: Ole Holm Nielsen via slurm-users slurm-users@lists.schedmd.com Sent: Thursday, March 14, 2024 1:16 PM To: slurm-users@lists.schedmd.com Subject: [slurm-users] Re: Jobs being denied for GrpCpuLimit despite having enough resource
Hi Simon,
Maybe you could print the user's limits using this tool: https://github.com/OleHolmNielsen/Slurm_tools/tree/master/showuserlimits
Which version of Slurm do you run?
/Ole
On 3/14/24 17:47, Simon Andrews via slurm-users wrote:
Our cluster has developed a strange intermittent behaviour where jobs are being put into a pending state because they aren't passing the AssocGrpCpuLimit, even though the user submitting has enough cpus for the job to run.
For example:
$ squeue -o "%.6i %.9P %.8j %.8u %.2t %.10M %.7m %.7c %.20R"
JOBID PARTITION NAME USER ST TIME MIN_MEM MIN_CPU NODELIST(REASON)
799 normal hostname andrewss PD 0:00 2G 5
(AssocGrpCpuLimit)
..so the job isn't running, and it's the only job in the queue, but:
$ sacctmgr list associations part=normal user=andrewss format=Account,User,Partition,Share,GrpTRES
Account User Partition Share GrpTRES
andrewss andrewss normal 1 cpu=5
That user has a limit of 5 CPUs so the job should run.
The weird thing is that this effect is intermittent. A job can hang and the queue will stall for ages but will then suddenly start working and you can submit several jobs and they all work, until one fails again.
The cluster has active nodes and plenty of resource:
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up infinite 2 idle compute-0-[6-7]
interactive up 1-12:00:00 3 idle compute-1-[0-1,3]
The slurmctld log just says:
[2024-03-14T16:21:41.275] _slurm_rpc_submit_batch_job: JobId=799 InitPrio=4294901720 usec=259
Whilst it's in this state I can run other jobs with core requests of up to 4 and they work, but not 5. It's like slurm is adding one CPU to the request and then denying it.
//
I'm sure I'm missing something fundamental but would appreciate it if someone could point out what it is!
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com