[slurm-users] Longer queuing times for larger jobs

Thu Feb 13 06:45:52 UTC 2020

Loris Bennett <loris.bennett at fu-berlin.de> writes:

> Hello David,
>
> David Baker <D.J.Baker at soton.ac.uk> writes:
>
>> Hello,
>>
>> I've taken a very good look at our cluster, however as yet not made
>> any significant changes. The one change that I did make was to
>> increase the "jobsizeweight". That's now our dominant parameter and it
>> does ensure that our largest jobs (> 20 nodes) are making it to the
>> top of the sprio listing which is what we want to see.
>>
>> These large jobs aren't making an progress despite the priority
>> lift. I additionally decreased the nice value of the job that sparked
>> this discussion. That is (looking at at sprio) there is a 32 node job
>> with a very high priority...
>>
>> JOBID PARTITION     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS        NICE
>> 280919 batch      mep1c10    1275481     400000      59827     415655          0                  0     -400000
>>
>> That job has been sitting in the queue for well over a week and it is
>> disconcerting that we never see nodes becoming idle in order to
>> service these large jobs. Nodes do become idle and then get scooped by
>> jobs started by backfill. Looking at the slurmctld logs I see that the
>> vast majority of jobs are being started via backfill -- including, for
>> example, a 24 node job. I see very few jobs allocated by the
>> scheduler. That is, messages like sched: Allocate JobId)6915 are few
>> and far between and I never see any of the large jobs being allocated
>> in the batch queue.
>>
>> Surely, this is not correct, however does anyone have any advice on
>> what to check, please?
>
> Have you looked at what 'sprio' says?  I usually want to see the list
> sorted by priority and so call it like this:
>
>   sprio -l -S "%Y"

This should be

  sprio -l -S "Y"

[snip (242 lines)]

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de