[slurm-users] Excessive use of backfill on a cluster

Tue Nov 20 08:32:36 MST 2018

Hello,

Thank you for your reply and for the explanation. That makes sense -- your explanation of backfill is as we expected. I think it's more that we are surprised that almost all our jobs were being scheduled using backfill. We very rarely see any being scheduled normally. It could be that we haven't actually tuned our priority weights particularly well. We potentially need a setup that will allow users to everything from small (including very small, small duration, test jobs with a high QOS) to large jobs running over a range of times without too many users losing out. Initially, we had our Age and Job size scaling factors too low, but have currently got the setup shown below.

Any thoughts, please?

Best regards,

David

PriorityParameters      = (null)
PriorityDecayHalfLife   = 14-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           = SMALL_RELATIVE_TO_TIME,DEPTH_OBLIVIOUS
PriorityMaxAge          = 14-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 100000
PriorityWeightFairShare = 100000
PriorityWeightJobSize   = 100000
PriorityWeightPartition = 0
PriorityWeightQOS       = 1000000
PriorityWeightTRES      = (null)
PropagatePrioProcess    = 0

________________________________
From: Loris Bennett <loris.bennett at fu-berlin.de>
Sent: 20 November 2018 13:26:14
To: Baker D.J.
Cc: Slurm User Community List
Subject: Re: [slurm-users] Excessive use of backfill on a cluster

Hi David,

Baker D.J. <D.J.Baker at soton.ac.uk> writes:

> Hello,
>
> We are running Slurm 18.08.0 on our cluster and I am concerned that
> Slurm appears to be using backfill scheduling excessively. In fact the
> vast majority of jobs are being scheduled using backfill. So, for
> example, I have just submitted a set of three serial jobs. They all
> started on a compute node that was completely free, but
> disconcertingly in the slurmctl log they were all reported as started
> using backfill and that isn't making sense...
>
> [2018-11-20T12:31:27.598] backfill: Started JobId=217031 in batch on red158
> [2018-11-20T12:32:28.004] backfill: Started JobId=217032 in batch on red158
> [2018-11-20T12:33:58.608] backfill: Started JobId=217033 in batch on red158
>
> I either don't understand the context of backfill re slurm or the
> above is odd. Has anyone seem this "overuse" (unnecessary) use of
> backfill on their cluster and/or could offer advice, please.

I am not sure what "excessive backfilling" might mean.  If you have
a job which requires a large amount of resources to become available
before it can start, then backfilling will allow other jobs with a lower
priority to be run, if this can be achieved without delaying the start
of the large job.  So if a job needs 100 nodes, at some point 99 of them
will be idle.  Job which can start and finish before the 100th node
becomes available will indeed be backfilled on empty nodes.  This is how
backfilling is supposed to work.

Or am I misunderstanding your problem?

Cheers,

Loris

--
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181120/efc84fdb/attachment.html>