[slurm-users] Cannot enable Gang scheduling
Kevin Broch
kbroch at rivosinc.com
Fri Jan 13 12:29:46 UTC 2023
My guess, is that this isn't possible with GANG,SUSPEND. GPU memory isn't
managed in Slurm so the idea of suspending GPU memory for another job to
use the rest simply isn't possible.
On Fri, Jan 13, 2023 at 4:08 AM Helder Daniel <hdaniel at ualg.pt> wrote:
> Hi Kevin
>
> I did a "scontrol show partition".
> Oversubscribe was not enabled.
> I enable it in slurm.conf with:
>
> (...)
> GresTypes=gpu
> NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2
> State=UNKNOWN
> PartitionName=asimov01 *OverSubscribe=FORCE* Nodes=asimov Default=YES
> MaxTime=INFINITE MaxNodes=1 DefCpuPerGPU=2 State=UP
>
> but now it is working only with CPU jobs. It does not preempt gpu jobs.
> Lauching 3 cpu only jobs, each requiring 32 out of 64 cores it preempt
> after the timeslice as expected
>
> sbatch --cpus-per-task=32 test-cpu.sh
>
> JOBID PARTITION NAME USER ST TIME NODES
> NODELIST(REASON)
> 352 asimov01 cpu-only hdaniel R 0:58 1 asimov
> 353 asimov01 cpu-only hdaniel R 0:25 1 asimov
> 351 asimov01 cpu-only hdaniel S 0:36 1 asimov
>
> But launching 3 GPU jobs, each requiring 2 out of 4 GPUs it does not
> preempt the first 2 that start running.
> It says that the 3rd job is hanging on resources.
>
> JOBID PARTITION NAME USER ST TIME NODES
> NODELIST(REASON)
> 356 asimov01 gpu hdaniel PD 0:00 1
> (Resources)
> 354 asimov01 gpu hdaniel R 3:05 1 asimov
> 355 asimov01 gpu hdaniel R 3:02 1 asimov
>
> Do I need to change anything else in the configuration to support also gpu
> gang scheduling?
> Thanks
>
>
> ============================================================================
> scontrol show partition asimov01
> PartitionName=asimov01
> AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
> AllocNodes=ALL Default=YES QoS=N/A
> DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
> Hidden=NO
> MaxNodes=1 MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
> Nodes=asimov
> PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
> OverSubscribe=NO
> OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
> State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=NONE
> JobDefaults=DefCpuPerGPU=2
> DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>
> On Fri, 13 Jan 2023 at 11:16, Kevin Broch <kbroch at rivosinc.com> wrote:
>
>> Problem might be that OverSubscribe is not enabled? w/o it, I don't
>> believe the time-slicing can be GANG scheduled
>>
>> Can you do a "scontrol show partition" to verify that it is?
>>
>> On Thu, Jan 12, 2023 at 6:24 PM Helder Daniel <hdaniel at ualg.pt> wrote:
>>
>>> Hi,
>>>
>>> I am trying to enable gang scheduling on a server with a CPU with 32
>>> cores and 4 GPUs.
>>>
>>> However, using Gang sched, the cpu jobs (or gpu jobs) are not being
>>> preempted after the time slice, which is set to 30 secs.
>>>
>>> Below is a snapshot of squeue. There are 3 jobs each needing 32 cores.
>>> The first 2 jobs launched are never preempted. The 3rd job is forever (or
>>> at least until one of the other 2 ends) starving:
>>>
>>> JOBID PARTITION NAME USER ST TIME NODES
>>> NODELIST(REASON)
>>> 313 asimov01 cpu-only hdaniel PD 0:00 1
>>> (Resources)
>>> 311 asimov01 cpu-only hdaniel R 1:52 1
>>> asimov
>>> 312 asimov01 cpu-only hdaniel R 1:49 1
>>> asimov
>>>
>>> The same happens with GPU jobs. If I launch 5 jobs, requiring one GPU
>>> each, the 5th job will never run. The preemption is not working with the
>>> specified timeslice.
>>>
>>> I tried several combinations:
>>>
>>> SchedulerType=sched/builtin and backfill
>>> SelectType=select/cons_tres and linear
>>>
>>> I'll appreciate any help and suggestions
>>> The slurm.conf is below.
>>> Thanks
>>>
>>> ClusterName=asimov
>>> SlurmctldHost=localhost
>>> MpiDefault=none
>>> ProctrackType=proctrack/linuxproc # proctrack/cgroup
>>> ReturnToService=2
>>> SlurmctldPidFile=/var/run/slurmctld.pid
>>> SlurmctldPort=6817
>>> SlurmdPidFile=/var/run/slurmd.pid
>>> SlurmdPort=6818
>>> SlurmdSpoolDir=/var/lib/slurm/slurmd
>>> SlurmUser=slurm
>>> StateSaveLocation=/var/lib/slurm/slurmctld
>>> SwitchType=switch/none
>>> TaskPlugin=task/none # task/cgroup
>>> #
>>> # TIMERS
>>> InactiveLimit=0
>>> KillWait=30
>>> MinJobAge=300
>>> SlurmctldTimeout=120
>>> SlurmdTimeout=300
>>> Waittime=0
>>> #
>>> # SCHEDULING
>>> #FastSchedule=1 #obsolete
>>> SchedulerType=sched/builtin #backfill
>>> SelectType=select/cons_tres
>>> SelectTypeParameters=CR_Core #CR_Core_Memory let's only one job run
>>> at a time
>>> PreemptType = preempt/partition_prio
>>> PreemptMode = SUSPEND,GANG
>>> SchedulerTimeSlice=30 #in seconds, default 30
>>> #
>>> # LOGGING AND ACCOUNTING
>>> #AccountingStoragePort=
>>> AccountingStorageType=accounting_storage/none
>>> #AccountingStorageEnforce=associations
>>> #ClusterName=bip-cluster
>>> JobAcctGatherFrequency=30
>>> JobAcctGatherType=jobacct_gather/linux
>>> SlurmctldDebug=info
>>> SlurmctldLogFile=/var/log/slurm/slurmctld.log
>>> SlurmdDebug=info
>>> SlurmdLogFile=/var/log/slurm/slurmd.log
>>> #
>>> #
>>> # COMPUTE NODES
>>> #NodeName=asimov CPUs=64 RealMemory=500 State=UNKNOWN
>>> #PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>>
>>> # Partitions
>>> GresTypes=gpu
>>> NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2
>>> State=UNKNOWN
>>> PartitionName=asimov01 Nodes=asimov Default=YES MaxTime=INFINITE
>>> MaxNodes=1 DefCpuPerGPU=2 State=UP
>>>
>>>
>
> --
> com os melhores cumprimentos,
>
> Helder Daniel
> Universidade do Algarve
> Faculdade de Ciências e Tecnologia
> Departamento de Engenharia Electrónica e Informática
> https://www.ualg.pt/pt/users/hdaniel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230113/748a7f44/attachment-0003.htm>
More information about the slurm-users
mailing list