[slurm-users] Cannot enable Gang scheduling

Fri Jan 13 12:29:46 UTC 2023

My guess, is that this isn't possible with GANG,SUSPEND.  GPU memory isn't
managed in Slurm so the idea of suspending GPU memory for another job to
use the rest simply isn't possible.

On Fri, Jan 13, 2023 at 4:08 AM Helder Daniel <hdaniel at ualg.pt> wrote:

> Hi Kevin
>
> I did a "scontrol show partition".
> Oversubscribe was not enabled.
> I enable it in slurm.conf with:
>
> (...)
> GresTypes=gpu
> NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2
> State=UNKNOWN
> PartitionName=asimov01 *OverSubscribe=FORCE* Nodes=asimov Default=YES
> MaxTime=INFINITE MaxNodes=1 DefCpuPerGPU=2 State=UP
>
> but now it is working only with CPU jobs. It does not preempt gpu jobs.
> Lauching 3 cpu only jobs, each requiring 32 out of 64 cores it preempt
> after the timeslice as expected
>
> sbatch --cpus-per-task=32 test-cpu.sh
>
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>                352  asimov01 cpu-only  hdaniel  R       0:58      1 asimov
>                353  asimov01 cpu-only  hdaniel  R       0:25      1 asimov
>                351  asimov01 cpu-only  hdaniel  S       0:36      1 asimov
>
> But launching 3 GPU jobs, each requiring 2 out of 4 GPUs it does not
> preempt the first 2 that start running.
> It says that the 3rd job is hanging on resources.
>
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>                356  asimov01      gpu  hdaniel PD       0:00      1
> (Resources)
>                354  asimov01      gpu  hdaniel  R       3:05      1 asimov
>                355  asimov01      gpu  hdaniel  R       3:02      1 asimov
>
> Do I need to change anything else in the configuration to support also gpu
> gang scheduling?
> Thanks
>
>
> ============================================================================
> scontrol show partition asimov01
> PartitionName=asimov01
>    AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
>    AllocNodes=ALL Default=YES QoS=N/A
>    DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
> Hidden=NO
>    MaxNodes=1 MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
>    Nodes=asimov
>    PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
> OverSubscribe=NO
>    OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
>    State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=NONE
>    JobDefaults=DefCpuPerGPU=2
>    DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>
> On Fri, 13 Jan 2023 at 11:16, Kevin Broch <kbroch at rivosinc.com> wrote:
>
>> Problem might be that OverSubscribe is not enabled?  w/o it, I don't
>> believe the time-slicing can be GANG scheduled
>>
>> Can you do a "scontrol show partition" to verify that it is?
>>
>> On Thu, Jan 12, 2023 at 6:24 PM Helder Daniel <hdaniel at ualg.pt> wrote:
>>
>>> Hi,
>>>
>>> I am trying to enable gang scheduling on a server with a CPU with 32
>>> cores and 4 GPUs.
>>>
>>> However, using Gang sched, the cpu jobs (or gpu jobs) are not being
>>> preempted after the time slice, which is set to 30 secs.
>>>
>>> Below is a snapshot of squeue. There are 3 jobs each needing 32 cores.
>>> The first 2 jobs launched are never preempted. The 3rd job is forever (or
>>> at least until one of the other 2 ends) starving:
>>>
>>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>>> NODELIST(REASON)
>>>                313  asimov01 cpu-only  hdaniel PD       0:00      1
>>> (Resources)
>>>                311  asimov01 cpu-only  hdaniel  R       1:52      1
>>> asimov
>>>                312  asimov01 cpu-only  hdaniel  R       1:49      1
>>> asimov
>>>
>>> The same happens with GPU jobs. If I launch 5 jobs, requiring one GPU
>>> each, the 5th job will never run. The preemption is not working with the
>>> specified timeslice.
>>>
>>> I tried several combinations:
>>>
>>> SchedulerType=sched/builtin  and backfill
>>> SelectType=select/cons_tres   and linear
>>>
>>> I'll appreciate any help and suggestions
>>> The slurm.conf is below.
>>> Thanks
>>>
>>> ClusterName=asimov
>>> SlurmctldHost=localhost
>>> MpiDefault=none
>>> ProctrackType=proctrack/linuxproc # proctrack/cgroup
>>> ReturnToService=2
>>> SlurmctldPidFile=/var/run/slurmctld.pid
>>> SlurmctldPort=6817
>>> SlurmdPidFile=/var/run/slurmd.pid
>>> SlurmdPort=6818
>>> SlurmdSpoolDir=/var/lib/slurm/slurmd
>>> SlurmUser=slurm
>>> StateSaveLocation=/var/lib/slurm/slurmctld
>>> SwitchType=switch/none
>>> TaskPlugin=task/none # task/cgroup
>>> #
>>> # TIMERS
>>> InactiveLimit=0
>>> KillWait=30
>>> MinJobAge=300
>>> SlurmctldTimeout=120
>>> SlurmdTimeout=300
>>> Waittime=0
>>> #
>>> # SCHEDULING
>>> #FastSchedule=1 #obsolete
>>> SchedulerType=sched/builtin #backfill
>>> SelectType=select/cons_tres
>>> SelectTypeParameters=CR_Core    #CR_Core_Memory let's only one job run
>>> at a time
>>> PreemptType = preempt/partition_prio
>>> PreemptMode = SUSPEND,GANG
>>> SchedulerTimeSlice=30           #in seconds, default 30
>>> #
>>> # LOGGING AND ACCOUNTING
>>> #AccountingStoragePort=
>>> AccountingStorageType=accounting_storage/none
>>> #AccountingStorageEnforce=associations
>>> #ClusterName=bip-cluster
>>> JobAcctGatherFrequency=30
>>> JobAcctGatherType=jobacct_gather/linux
>>> SlurmctldDebug=info
>>> SlurmctldLogFile=/var/log/slurm/slurmctld.log
>>> SlurmdDebug=info
>>> SlurmdLogFile=/var/log/slurm/slurmd.log
>>> #
>>> #
>>> # COMPUTE NODES
>>> #NodeName=asimov CPUs=64 RealMemory=500 State=UNKNOWN
>>> #PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>>
>>> # Partitions
>>> GresTypes=gpu
>>> NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2
>>> State=UNKNOWN
>>> PartitionName=asimov01 Nodes=asimov Default=YES MaxTime=INFINITE
>>> MaxNodes=1 DefCpuPerGPU=2 State=UP
>>>
>>>
>
> --
> com os melhores cumprimentos,
>
> Helder Daniel
> Universidade do Algarve
> Faculdade de Ciências e Tecnologia
> Departamento de Engenharia Electrónica e Informática
> https://www.ualg.pt/pt/users/hdaniel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230113/748a7f44/attachment-0003.htm>