[slurm-users] Cannot enable Gang scheduling

Fri Jan 13 13:19:22 UTC 2023

Thanks for all your Help Kevin,
I really did miss the OverSubscribe option in the docs :-(
But now cpu job scheduling is working and I have a picture of the problem
with gpu job scheduling to dig further :-)

On Fri, 13 Jan 2023 at 13:01, Kevin Broch <kbroch at rivosinc.com> wrote:

> Sorry to hear that. Hopefully others in the group have some
> ideas/explanations.  I haven't had to deal with GPU resources in Slurm.
>
> On Fri, Jan 13, 2023 at 4:51 AM Helder Daniel <hdaniel at ualg.pt> wrote:
>
>> Oh, ok.
>> I guess I was expecting that the GPU job was suspended copying GPU memory
>> to RAM memory.
>>
>> I tried also: REQUEUE,GANG and CANCEL,GANG.
>>
>> None of these options seems to be able to preempt GPU jobs
>>
>> On Fri, 13 Jan 2023 at 12:30, Kevin Broch <kbroch at rivosinc.com> wrote:
>>
>>> My guess, is that this isn't possible with GANG,SUSPEND.  GPU memory
>>> isn't managed in Slurm so the idea of suspending GPU memory for another job
>>> to use the rest simply isn't possible.
>>>
>>> On Fri, Jan 13, 2023 at 4:08 AM Helder Daniel <hdaniel at ualg.pt> wrote:
>>>
>>>> Hi Kevin
>>>>
>>>> I did a "scontrol show partition".
>>>> Oversubscribe was not enabled.
>>>> I enable it in slurm.conf with:
>>>>
>>>> (...)
>>>> GresTypes=gpu
>>>> NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2
>>>> State=UNKNOWN
>>>> PartitionName=asimov01 *OverSubscribe=FORCE* Nodes=asimov Default=YES
>>>> MaxTime=INFINITE MaxNodes=1 DefCpuPerGPU=2 State=UP
>>>>
>>>> but now it is working only with CPU jobs. It does not preempt gpu jobs.
>>>> Lauching 3 cpu only jobs, each requiring 32 out of 64 cores it preempt
>>>> after the timeslice as expected
>>>>
>>>> sbatch --cpus-per-task=32 test-cpu.sh
>>>>
>>>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>>>> NODELIST(REASON)
>>>>                352  asimov01 cpu-only  hdaniel  R       0:58      1
>>>> asimov
>>>>                353  asimov01 cpu-only  hdaniel  R       0:25      1
>>>> asimov
>>>>                351  asimov01 cpu-only  hdaniel  S       0:36      1
>>>> asimov
>>>>
>>>> But launching 3 GPU jobs, each requiring 2 out of 4 GPUs it does not
>>>> preempt the first 2 that start running.
>>>> It says that the 3rd job is hanging on resources.
>>>>
>>>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>>>> NODELIST(REASON)
>>>>                356  asimov01      gpu  hdaniel PD       0:00      1
>>>> (Resources)
>>>>                354  asimov01      gpu  hdaniel  R       3:05      1
>>>> asimov
>>>>                355  asimov01      gpu  hdaniel  R       3:02      1
>>>> asimov
>>>>
>>>> Do I need to change anything else in the configuration to support also
>>>> gpu gang scheduling?
>>>> Thanks
>>>>
>>>>
>>>> ============================================================================
>>>> scontrol show partition asimov01
>>>> PartitionName=asimov01
>>>>    AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
>>>>    AllocNodes=ALL Default=YES QoS=N/A
>>>>    DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
>>>> Hidden=NO
>>>>    MaxNodes=1 MaxTime=UNLIMITED MinNodes=0 LLN=NO
>>>> MaxCPUsPerNode=UNLIMITED
>>>>    Nodes=asimov
>>>>    PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
>>>> OverSubscribe=NO
>>>>    OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
>>>>    State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=NONE
>>>>    JobDefaults=DefCpuPerGPU=2
>>>>    DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>>>>
>>>> On Fri, 13 Jan 2023 at 11:16, Kevin Broch <kbroch at rivosinc.com> wrote:
>>>>
>>>>> Problem might be that OverSubscribe is not enabled?  w/o it, I don't
>>>>> believe the time-slicing can be GANG scheduled
>>>>>
>>>>> Can you do a "scontrol show partition" to verify that it is?
>>>>>
>>>>> On Thu, Jan 12, 2023 at 6:24 PM Helder Daniel <hdaniel at ualg.pt> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am trying to enable gang scheduling on a server with a CPU with 32
>>>>>> cores and 4 GPUs.
>>>>>>
>>>>>> However, using Gang sched, the cpu jobs (or gpu jobs) are not being
>>>>>> preempted after the time slice, which is set to 30 secs.
>>>>>>
>>>>>> Below is a snapshot of squeue. There are 3 jobs each needing 32
>>>>>> cores. The first 2 jobs launched are never preempted. The 3rd job is
>>>>>> forever (or at least until one of the other 2 ends) starving:
>>>>>>
>>>>>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>>>>>> NODELIST(REASON)
>>>>>>                313  asimov01 cpu-only  hdaniel PD       0:00      1
>>>>>> (Resources)
>>>>>>                311  asimov01 cpu-only  hdaniel  R       1:52      1
>>>>>> asimov
>>>>>>                312  asimov01 cpu-only  hdaniel  R       1:49      1
>>>>>> asimov
>>>>>>
>>>>>> The same happens with GPU jobs. If I launch 5 jobs, requiring one GPU
>>>>>> each, the 5th job will never run. The preemption is not working with the
>>>>>> specified timeslice.
>>>>>>
>>>>>> I tried several combinations:
>>>>>>
>>>>>> SchedulerType=sched/builtin  and backfill
>>>>>> SelectType=select/cons_tres   and linear
>>>>>>
>>>>>> I'll appreciate any help and suggestions
>>>>>> The slurm.conf is below.
>>>>>> Thanks
>>>>>>
>>>>>> ClusterName=asimov
>>>>>> SlurmctldHost=localhost
>>>>>> MpiDefault=none
>>>>>> ProctrackType=proctrack/linuxproc # proctrack/cgroup
>>>>>> ReturnToService=2
>>>>>> SlurmctldPidFile=/var/run/slurmctld.pid
>>>>>> SlurmctldPort=6817
>>>>>> SlurmdPidFile=/var/run/slurmd.pid
>>>>>> SlurmdPort=6818
>>>>>> SlurmdSpoolDir=/var/lib/slurm/slurmd
>>>>>> SlurmUser=slurm
>>>>>> StateSaveLocation=/var/lib/slurm/slurmctld
>>>>>> SwitchType=switch/none
>>>>>> TaskPlugin=task/none # task/cgroup
>>>>>> #
>>>>>> # TIMERS
>>>>>> InactiveLimit=0
>>>>>> KillWait=30
>>>>>> MinJobAge=300
>>>>>> SlurmctldTimeout=120
>>>>>> SlurmdTimeout=300
>>>>>> Waittime=0
>>>>>> #
>>>>>> # SCHEDULING
>>>>>> #FastSchedule=1 #obsolete
>>>>>> SchedulerType=sched/builtin #backfill
>>>>>> SelectType=select/cons_tres
>>>>>> SelectTypeParameters=CR_Core    #CR_Core_Memory let's only one job
>>>>>> run at a time
>>>>>> PreemptType = preempt/partition_prio
>>>>>> PreemptMode = SUSPEND,GANG
>>>>>> SchedulerTimeSlice=30           #in seconds, default 30
>>>>>> #
>>>>>> # LOGGING AND ACCOUNTING
>>>>>> #AccountingStoragePort=
>>>>>> AccountingStorageType=accounting_storage/none
>>>>>> #AccountingStorageEnforce=associations
>>>>>> #ClusterName=bip-cluster
>>>>>> JobAcctGatherFrequency=30
>>>>>> JobAcctGatherType=jobacct_gather/linux
>>>>>> SlurmctldDebug=info
>>>>>> SlurmctldLogFile=/var/log/slurm/slurmctld.log
>>>>>> SlurmdDebug=info
>>>>>> SlurmdLogFile=/var/log/slurm/slurmd.log
>>>>>> #
>>>>>> #
>>>>>> # COMPUTE NODES
>>>>>> #NodeName=asimov CPUs=64 RealMemory=500 State=UNKNOWN
>>>>>> #PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>>>>>
>>>>>> # Partitions
>>>>>> GresTypes=gpu
>>>>>> NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32
>>>>>> ThreadsPerCore=2 State=UNKNOWN
>>>>>> PartitionName=asimov01 Nodes=asimov Default=YES MaxTime=INFINITE
>>>>>> MaxNodes=1 DefCpuPerGPU=2 State=UP
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> com os melhores cumprimentos,
>>>>
>>>> Helder Daniel
>>>> Universidade do Algarve
>>>> Faculdade de Ciências e Tecnologia
>>>> Departamento de Engenharia Electrónica e Informática
>>>> https://www.ualg.pt/pt/users/hdaniel
>>>>
>>>
>>
>> --
>> com os melhores cumprimentos,
>>
>> Helder Daniel
>> Universidade do Algarve
>> Faculdade de Ciências e Tecnologia
>> Departamento de Engenharia Electrónica e Informática
>> https://www.ualg.pt/pt/users/hdaniel
>>
>

-- 
com os melhores cumprimentos,

Helder Daniel
Universidade do Algarve
Faculdade de Ciências e Tecnologia
Departamento de Engenharia Electrónica e Informática
https://www.ualg.pt/pt/users/hdaniel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230113/eccbcba4/attachment-0003.htm>