[slurm-users] Large job starvation on cloud cluster

Wed Feb 27 21:41:52 UTC 2019

> You have not provided enough information (cluster configuration, job
information, etc) to diagnose what accounting policy is being violated.

Yeah, sorry.  I'm trying to balance the amount of information and likely
skewed too concise 8-/

The partition looks like:

PartitionName=largenode
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=lg_node_default
   DefaultTime=3-00:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO
   MaxNodes=UNLIMITED MaxTime=7-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=18
   Nodes=lg_nodeg[0-103],lg_nodeh[0-34]
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
OverSubscribe=YES:4
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2432 TotalNodes=139 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=245859

The partition QOS (lg_node_default) had no limits configured- as indicated,
I've since added a "MaxTRESPU" to limit per-user CPU utilisation to get
jobs running again.

The nodes in this partition are large enough to satisfy both the larger and
smaller jobs:

NodeName=lg_nodeg1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=18 CPUTot=18 CPULoad=3.00
   AvailableFeatures=c5.9xlarge
   ActiveFeatures=c5.9xlarge
   Gres=(null)
   NodeAddr=lg_nodeg1.fhcrc.org NodeHostName=lg_nodeg1 Port=0 Version=18.08
   OS=Linux 4.4.0-141-generic #167~14.04.1-Ubuntu SMP Mon Dec 10 13:20:24
UTC 2018
   RealMemory=70348 AllocMem=0 FreeMem=25134 Sockets=18 Boards=1
   State=ALLOCATED+CLOUD ThreadsPerCore=1 TmpDisk=7924 Weight=40 Owner=N/A
MCS_label=N/A
   Partitions=largenode
   BootTime=2019-02-20T07:58:22 SlurmdStartTime=2019-02-20T07:58:38
   CfgTRES=cpu=18,mem=70348M,billing=18
   AllocTRES=cpu=18
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=lg_nodeh1 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=12 CPUTot=16 CPULoad=2.02
   AvailableFeatures=r4.8xlarge
   ActiveFeatures=r4.8xlarge
   Gres=(null)
   NodeAddr=lg_nodeh1.fhcrc.org NodeHostName=lg_nodeh1 Port=0 Version=18.08
   OS=Linux 4.4.0-141-generic #167~14.04.1-Ubuntu SMP Mon Dec 10 13:20:24
UTC 2018
   RealMemory=245853 AllocMem=0 FreeMem=147943 Sockets=16 Boards=1
   State=MIXED+CLOUD ThreadsPerCore=1 TmpDisk=7924 Weight=80 Owner=N/A
MCS_label=N/A
   Partitions=largenode
   BootTime=2019-02-27T01:35:35 SlurmdStartTime=2019-02-27T01:35:47
   CfgTRES=cpu=16,mem=245853M,billing=16
   AllocTRES=cpu=12
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

The limit that is in play is on the account:

   Account GrpJobs GrpNodes  GrpCPUs  GrpMem GrpSubmit
---------- ------- -------- -------- ------- ---------
  account1                       300

Some possibly relevant slurm.conf parameters:

AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageEnforce=limits,qos
FastSchedule=0
SchedulerType=sched/backfill
SchedulerParameters=bf_resolution=360,defer,bf_continue,bf_max_job_user=10,bf_window=10080
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
PreemptMode=OFF
PreemptType=preempt/none
PriorityType=priority/multifactor
PriorityDecayHalfLife=1-00:00:00
PriorityMaxAge=1-00:00:00
PriorityWeightAge=10
PriorityWeightFairshare=100000000
PriorityWeightQOS=1000000

and finally the power-management:

SuspendProgram=/var/lib/slurm-llnl/suspend
SuspendTime=300
SuspendRate=10
ResumeProgram=/var/lib/slurm-llnl/resume
ResumeRate=10
ResumeTimeout=300

There's no logic in the suspend/resume scripts- these simply start and stop
nodes according to what slurmctld says.  I don't know exactly what logic
the controller uses to start or stop nodes, but I do know it isn't
attempting to start nodes to satisfy the larger job.

> JobId=2210784 delayed for accounting policy is likely the key as it
indicates the job is currently unable to run, so the lower priority smaller
job bumps ahead of it.

Yeah, that's exactly what I think is happening.  In the on-prem cluster the
backfill scheduler creates a priority reservation for that higher-priority
job which keeps those from getting starved.  However, out in the cloud
cluster that doesn't seem to happen.

Thanks for looking at the problem

 - Michael

On Wed, Feb 27, 2019 at 12:54 PM Thomas M. Payerle <payerle at umd.edu> wrote:

> The "JobId=2210784 delayed for accounting policy is likely the key as it
> indicates the job is currently unable to run, so the lower priority smaller
> job bumps ahead of it.
> You have not provided enough information (cluster configuration, job
> information, etc) to diagnose what accounting policy is being violated.
> Like you, I suspect that this is happening due to power management and
> powered-down nodes (I am not experienced with sending jobs to the cloud)
> --- what is the policy for starting the powered down nodes?  I can also see
> issues due to the delay in starting the powered down nodes; the scheduler
> starts looking at 2210784, are not enough idle, running nodes to launch it,
> maybe it tells some nodes to spin up, but by time they spin up it already
> assigned the previously up and idle nodes to the smaller job.
>
> On Wed, Feb 27, 2019 at 3:33 PM Michael Gutteridge <
> michael.gutteridge at gmail.com> wrote:
>
>> I've run into a problem with a cluster we've got in a cloud provider-
>> hoping someone might have some advice.
>>
>> The problem is that I've got a circumstance where large jobs _never_
>> start... or more correctly, that large-er jobs don't start when there are
>> many smaller jobs in the partition.  In this cluster, accounts are limited
>> to 300 cores.  One user has submitted a couple thousand jobs that each use
>> 6 cores.  These queue up, start nodes, and eventually all 300 cores in the
>> limit are busy and the remaining jobs are held with "AssocGrpCpuLimit".
>> All as expected.
>>
>> Then another user submits a job requesting 16 cores.  This one, too, gets
>> held with the same reason.  However, that larger job never starts even if
>> it has the highest priority of jobs in this account (I've set it manually
>> and by using nice).
>>
>> What I see in the sched.log is:
>>
>> sched: [2019-02-25T16:00:14.940] Running job scheduler
>> sched: [2019-02-25T16:00:14.941] JobId=2210784 delayed for accounting
>> policy
>> sched: [2019-02-25T16:00:14.942] JobId=2203130 initiated
>> sched: [2019-02-25T16:00:14.942] Allocate JobId=2203130 NodeList=node1
>> #CPUs=6 Partition=largenode
>>
>> In this case, 2210784 is the job requesting 16 cores and 2203130 is one
>> of the six core jobs.  This seems to happen with either the backfill or
>> builtin scheduler.  I suspect what's happening is that when one of the
>> smaller jobs completes, the scheduler first looks at the higher-priority
>> large job, determines that it cannot run because of the constraint, looks
>> at the next job in the list, determines that it can run without exceeding
>> the limit, and then starts that job.  In this way, the larger job isn't
>> started until all of these smaller jobs complete.
>>
>> I thought that switching to the builtin scheduler would fix this, but as
>> slurm.conf(5) indicates:
>>
>> > An exception is made for jobs that can not run due
>> > to partition constraints (e.g. the time limit) or
>> > down/drained nodes.  In that case, lower priority
>> > jobs can be initiated and not impact the higher
>> > priority job.
>>
>> I suspect one of these exceptions is being triggered- the limit is in the
>> job's association, so I don't think it's a partition constraint.  I don't
>> have this problem with the on-premises cluster, so I suspect it's something
>> to do with power management and the state of powered-down nodes.
>>
>> I've sort-of worked around this by setting a per-user limit lower than
>> the per-account limit, but that doesn't address any situation where a
>> single user submits large and small jobs and does lead to some other
>> problems in other groups, so it's not a long-term solution.
>>
>> Thanks for having a look
>>
>>  - Michael
>>
>>
>
> --
> Tom Payerle
> DIT-ACIGS/Mid-Atlantic Crossroads        payerle at umd.edu
> 5825 University Research Park               (301) 405-6135
> University of Maryland
> College Park, MD 20740-3831
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190227/c5c8831a/attachment.html>