[slurm-users] Longer queuing times for larger jobs
Loris Bennett
loris.bennett at fu-berlin.de
Thu Feb 13 06:45:52 UTC 2020
Loris Bennett <loris.bennett at fu-berlin.de> writes:
> Hello David,
>
> David Baker <D.J.Baker at soton.ac.uk> writes:
>
>> Hello,
>>
>> I've taken a very good look at our cluster, however as yet not made
>> any significant changes. The one change that I did make was to
>> increase the "jobsizeweight". That's now our dominant parameter and it
>> does ensure that our largest jobs (> 20 nodes) are making it to the
>> top of the sprio listing which is what we want to see.
>>
>> These large jobs aren't making an progress despite the priority
>> lift. I additionally decreased the nice value of the job that sparked
>> this discussion. That is (looking at at sprio) there is a 32 node job
>> with a very high priority...
>>
>> JOBID PARTITION USER PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS NICE
>> 280919 batch mep1c10 1275481 400000 59827 415655 0 0 -400000
>>
>> That job has been sitting in the queue for well over a week and it is
>> disconcerting that we never see nodes becoming idle in order to
>> service these large jobs. Nodes do become idle and then get scooped by
>> jobs started by backfill. Looking at the slurmctld logs I see that the
>> vast majority of jobs are being started via backfill -- including, for
>> example, a 24 node job. I see very few jobs allocated by the
>> scheduler. That is, messages like sched: Allocate JobId)6915 are few
>> and far between and I never see any of the large jobs being allocated
>> in the batch queue.
>>
>> Surely, this is not correct, however does anyone have any advice on
>> what to check, please?
>
> Have you looked at what 'sprio' says? I usually want to see the list
> sorted by priority and so call it like this:
>
> sprio -l -S "%Y"
This should be
sprio -l -S "Y"
[snip (242 lines)]
--
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.bennett at fu-berlin.de
More information about the slurm-users
mailing list