[slurm-users] CPU allocation for the GPU jobs.

Mon Jul 13 12:09:46 UTC 2020

Thanks Renfro. My scheduling policy is below.

SchedulerType=sched/builtin
SelectType=select/cons_res
SelectTypeParameters=CR_Core
AccountingStorageEnforce=associations
AccountingStorageHost=192.168.150.223
AccountingStorageType=accounting_storage/slurmdbd
ClusterName=hpc
JobCompType=jobcomp/slurmdbd
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=5
SlurmdDebug=5
Waittime=0
Epilog=/etc/slurm/slurm.epilog.clean
GresTypes=gpu
MaxJobCount=5000000
SchedulerParameters=enable_user_top,default_queue_depth=1000000

# JOB PRIORITY
PriorityType=priority/multifactor
PriorityDecayHalfLife=2
PriorityUsageResetPeriod=DAILY
PriorityWeightFairshare=500000
PriorityFlags=FAIR_TREE

let me try changing it to the backfill and will see if it helps.

Regards
Navin.

On Mon, Jul 13, 2020 at 5:16 PM Renfro, Michael <Renfro at tntech.edu> wrote:

> “The *SchedulerType* configuration parameter specifies the scheduler
> plugin to use. Options are sched/backfill, which performs backfill
> scheduling, and sched/builtin, which attempts to schedule jobs in a strict
> priority order within each partition/queue.”
>
> https://slurm.schedmd.com/sched_config.html
>
> If you’re using the builtin scheduler, lower priority jobs have no way to
> run ahead of higher priority jobs. If you’re using the backfill scheduler,
> your jobs will need specific wall times specified, since the idea with
> backfill is to run lower priority jobs ahead of time if and only if they
> can complete without delaying the estimated start time of higher priority
> jobs.
>
> On Jul 13, 2020, at 4:18 AM, navin srivastava <navin.altair at gmail.com>
> wrote:
>
> Hi Team,
>
> We have separate partitions for the GPU nodes and only CPU nodes .
>
> scenario: the jobs submitted in our environment is 4CPU+1GPU  as well as
> 4CPU only in  nodeGPUsmall and nodeGPUbig. so when all the GPU exhausted
> and rest other jobs are in queue waiting for the availability of GPU
> resources.the job submitted with only CPU is not going through even
> though plenty of CPU resources are available but the job which is only
> looking CPU, also on pend because of these GPU based jobs( priority of GPU
> jobs is higher than CPU one).
>
> Is there any option here we can do,so that when all GPU resources are
> exhausted then it should allow the CPU jobs. Is there a way to deal with
> it? or some custom solution which we can think of.  There is no issue with
> CPU only partitions.
>
> Below is the my slurm configuration file
>
>
> NodeName=node[1-12] NodeAddr=node[1-12] Sockets=2 CoresPerSocket=10
> RealMemory=128833 State=UNKNOWN
> NodeName=node[13-16] NodeAddr=node[13-16] Sockets=2 CoresPerSocket=10
> RealMemory=515954 Feature=HIGHMEM State=UNKNOWN
> NodeName=node[28-32]  NodeAddr=node[28-32] Sockets=2 CoresPerSocket=28
> RealMemory=257389
> NodeName=node[32-33]  NodeAddr=node[32-33] Sockets=2 CoresPerSocket=24
> RealMemory=773418
> NodeName=node[17-27]  NodeAddr=node[17-27] Sockets=2 CoresPerSocket=18
> RealMemory=257687 Feature=K2200 Gres=gpu:2
> NodeName=node[34]  NodeAddr=node34 Sockets=2 CoresPerSocket=24
> RealMemory=773410 Feature=RTX Gres=gpu:8
>
>
> PartitionName=node Nodes=node[1-10,14-16,28-33,35]  Default=YES
> MaxTime=INFINITE State=UP Shared=YES
> PartitionName=nodeGPUsmall Nodes=node[17-27]  Default=NO MaxTime=INFINITE
> State=UP Shared=YES
> PartitionName=nodeGPUbig Nodes=node[34]  Default=NO MaxTime=INFINITE
> State=UP Shared=YES
>
> Regards
> Navin.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200713/43d25206/attachment.htm>