[slurm-users] Cannot enable Gang scheduling

Helder Daniel hdaniel at ualg.pt
Fri Jan 13 12:08:43 UTC 2023


Hi Kevin

I did a "scontrol show partition".
Oversubscribe was not enabled.
I enable it in slurm.conf with:

(...)
GresTypes=gpu
NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2
State=UNKNOWN
PartitionName=asimov01 *OverSubscribe=FORCE* Nodes=asimov Default=YES
MaxTime=INFINITE MaxNodes=1 DefCpuPerGPU=2 State=UP

but now it is working only with CPU jobs. It does not preempt gpu jobs.
Lauching 3 cpu only jobs, each requiring 32 out of 64 cores it preempt
after the timeslice as expected

sbatch --cpus-per-task=32 test-cpu.sh

             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
               352  asimov01 cpu-only  hdaniel  R       0:58      1 asimov
               353  asimov01 cpu-only  hdaniel  R       0:25      1 asimov
               351  asimov01 cpu-only  hdaniel  S       0:36      1 asimov

But launching 3 GPU jobs, each requiring 2 out of 4 GPUs it does not
preempt the first 2 that start running.
It says that the 3rd job is hanging on resources.

             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
               356  asimov01      gpu  hdaniel PD       0:00      1
(Resources)
               354  asimov01      gpu  hdaniel  R       3:05      1 asimov
               355  asimov01      gpu  hdaniel  R       3:02      1 asimov

Do I need to change anything else in the configuration to support also gpu
gang scheduling?
Thanks

============================================================================
scontrol show partition asimov01
PartitionName=asimov01
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
Hidden=NO
   MaxNodes=1 MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=asimov
   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
   State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=NONE
   JobDefaults=DefCpuPerGPU=2
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

On Fri, 13 Jan 2023 at 11:16, Kevin Broch <kbroch at rivosinc.com> wrote:

> Problem might be that OverSubscribe is not enabled?  w/o it, I don't
> believe the time-slicing can be GANG scheduled
>
> Can you do a "scontrol show partition" to verify that it is?
>
> On Thu, Jan 12, 2023 at 6:24 PM Helder Daniel <hdaniel at ualg.pt> wrote:
>
>> Hi,
>>
>> I am trying to enable gang scheduling on a server with a CPU with 32
>> cores and 4 GPUs.
>>
>> However, using Gang sched, the cpu jobs (or gpu jobs) are not being
>> preempted after the time slice, which is set to 30 secs.
>>
>> Below is a snapshot of squeue. There are 3 jobs each needing 32 cores.
>> The first 2 jobs launched are never preempted. The 3rd job is forever (or
>> at least until one of the other 2 ends) starving:
>>
>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>> NODELIST(REASON)
>>                313  asimov01 cpu-only  hdaniel PD       0:00      1
>> (Resources)
>>                311  asimov01 cpu-only  hdaniel  R       1:52      1 asimov
>>                312  asimov01 cpu-only  hdaniel  R       1:49      1 asimov
>>
>> The same happens with GPU jobs. If I launch 5 jobs, requiring one GPU
>> each, the 5th job will never run. The preemption is not working with the
>> specified timeslice.
>>
>> I tried several combinations:
>>
>> SchedulerType=sched/builtin  and backfill
>> SelectType=select/cons_tres   and linear
>>
>> I'll appreciate any help and suggestions
>> The slurm.conf is below.
>> Thanks
>>
>> ClusterName=asimov
>> SlurmctldHost=localhost
>> MpiDefault=none
>> ProctrackType=proctrack/linuxproc # proctrack/cgroup
>> ReturnToService=2
>> SlurmctldPidFile=/var/run/slurmctld.pid
>> SlurmctldPort=6817
>> SlurmdPidFile=/var/run/slurmd.pid
>> SlurmdPort=6818
>> SlurmdSpoolDir=/var/lib/slurm/slurmd
>> SlurmUser=slurm
>> StateSaveLocation=/var/lib/slurm/slurmctld
>> SwitchType=switch/none
>> TaskPlugin=task/none # task/cgroup
>> #
>> # TIMERS
>> InactiveLimit=0
>> KillWait=30
>> MinJobAge=300
>> SlurmctldTimeout=120
>> SlurmdTimeout=300
>> Waittime=0
>> #
>> # SCHEDULING
>> #FastSchedule=1 #obsolete
>> SchedulerType=sched/builtin #backfill
>> SelectType=select/cons_tres
>> SelectTypeParameters=CR_Core    #CR_Core_Memory let's only one job run at
>> a time
>> PreemptType = preempt/partition_prio
>> PreemptMode = SUSPEND,GANG
>> SchedulerTimeSlice=30           #in seconds, default 30
>> #
>> # LOGGING AND ACCOUNTING
>> #AccountingStoragePort=
>> AccountingStorageType=accounting_storage/none
>> #AccountingStorageEnforce=associations
>> #ClusterName=bip-cluster
>> JobAcctGatherFrequency=30
>> JobAcctGatherType=jobacct_gather/linux
>> SlurmctldDebug=info
>> SlurmctldLogFile=/var/log/slurm/slurmctld.log
>> SlurmdDebug=info
>> SlurmdLogFile=/var/log/slurm/slurmd.log
>> #
>> #
>> # COMPUTE NODES
>> #NodeName=asimov CPUs=64 RealMemory=500 State=UNKNOWN
>> #PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>
>> # Partitions
>> GresTypes=gpu
>> NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2
>> State=UNKNOWN
>> PartitionName=asimov01 Nodes=asimov Default=YES MaxTime=INFINITE
>> MaxNodes=1 DefCpuPerGPU=2 State=UP
>>
>>

-- 
com os melhores cumprimentos,

Helder Daniel
Universidade do Algarve
Faculdade de Ciências e Tecnologia
Departamento de Engenharia Electrónica e Informática
https://www.ualg.pt/pt/users/hdaniel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230113/c5687145/attachment-0003.htm>


More information about the slurm-users mailing list