[slurm-users] Larger jobs tend to get starved out on our cluster

Loris Bennett loris.bennett at fu-berlin.de
Wed Jan 9 23:48:58 MST 2019


Hi David,

If your maximum run-time is more than the 2 1/2 days (3600 minutes) you
have set for bf_window, you might need to increase bf_window
accordingly.   See the description here:

https://slurm.schedmd.com/sched_config.html

Cheers,

Loris

Baker D.J. <D.J.Baker at soton.ac.uk> writes:

> Hello,
>
> A colleague intimated that he thought that larger jobs were tending to
> get starved out on our slurm cluster. It's not a busy time at the
> moment so it's difficult to test this properly. Back in November it
> was not completely unusual for a larger job to have to wait up to a
> week to start.
>
> I've extracted the key scheduling configuration out of the slurm.conf
> and I would appreciate your comments, please. Even at the busiest of
> times we notice many single compute jobs executing on the cluster --
> starting either via the scheduler or by backfill.
>
> Looking at the scheduling configuration do you think that I'm
> favouring small jobs too much? That is, for example, should I increase
> the PriorityWeightJobSize to encourage larger jobs to run?
>
> I was very keen not to starve out small/medium jobs, however perhaps
> there is too much emphasis on small/medium jobs in our setup.
>
> My colleague is from a Moab background, and in that respect he was
> surprised not to see nodes being reserved for jobs, but it could be
> that Slurm works in a different way to try to make efficient use of
> the cluster by backfilling more aggressively than Moab. Certainly we
> see a great deal of activity from backfill.
>
> In this respect does anyone understand the mechanism used to reserve
> nodes/resources for jobs in slurm or potentially where to look for
> that type of information.
>
> Best regards,
> David
>
> SchedulerType=sched/backfill
> SchedulerParameters=bf_window=3600,bf_resolution=180,bf_max_job_user=4
>
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core
> FastSchedule=1
> PriorityFavorSmall=NO
> PriorityFlags=DEPTH_OBLIVIOUS,SMALL_RELATIVE_TO_TIME,FAIR_TREE
> PriorityType=priority/multifactor
> PriorityDecayHalfLife=14-0
>
> PriorityWeightFairshare=1000000
> PriorityWeightAge=100000
> PriorityWeightPartition=0
> PriorityWeightJobSize=100000
> PriorityWeightQOS=10000
> PriorityMaxAge=7-0
>
>
-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de



More information about the slurm-users mailing list