[slurm-users] Cannot enable Gang scheduling

Kevin Broch kbroch at rivosinc.com
Fri Jan 13 13:01:40 UTC 2023


Sorry to hear that. Hopefully others in the group have some
ideas/explanations.  I haven't had to deal with GPU resources in Slurm.

On Fri, Jan 13, 2023 at 4:51 AM Helder Daniel <hdaniel at ualg.pt> wrote:

> Oh, ok.
> I guess I was expecting that the GPU job was suspended copying GPU memory
> to RAM memory.
>
> I tried also: REQUEUE,GANG and CANCEL,GANG.
>
> None of these options seems to be able to preempt GPU jobs
>
> On Fri, 13 Jan 2023 at 12:30, Kevin Broch <kbroch at rivosinc.com> wrote:
>
>> My guess, is that this isn't possible with GANG,SUSPEND.  GPU memory
>> isn't managed in Slurm so the idea of suspending GPU memory for another job
>> to use the rest simply isn't possible.
>>
>> On Fri, Jan 13, 2023 at 4:08 AM Helder Daniel <hdaniel at ualg.pt> wrote:
>>
>>> Hi Kevin
>>>
>>> I did a "scontrol show partition".
>>> Oversubscribe was not enabled.
>>> I enable it in slurm.conf with:
>>>
>>> (...)
>>> GresTypes=gpu
>>> NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2
>>> State=UNKNOWN
>>> PartitionName=asimov01 *OverSubscribe=FORCE* Nodes=asimov Default=YES
>>> MaxTime=INFINITE MaxNodes=1 DefCpuPerGPU=2 State=UP
>>>
>>> but now it is working only with CPU jobs. It does not preempt gpu jobs.
>>> Lauching 3 cpu only jobs, each requiring 32 out of 64 cores it preempt
>>> after the timeslice as expected
>>>
>>> sbatch --cpus-per-task=32 test-cpu.sh
>>>
>>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>>> NODELIST(REASON)
>>>                352  asimov01 cpu-only  hdaniel  R       0:58      1
>>> asimov
>>>                353  asimov01 cpu-only  hdaniel  R       0:25      1
>>> asimov
>>>                351  asimov01 cpu-only  hdaniel  S       0:36      1
>>> asimov
>>>
>>> But launching 3 GPU jobs, each requiring 2 out of 4 GPUs it does not
>>> preempt the first 2 that start running.
>>> It says that the 3rd job is hanging on resources.
>>>
>>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>>> NODELIST(REASON)
>>>                356  asimov01      gpu  hdaniel PD       0:00      1
>>> (Resources)
>>>                354  asimov01      gpu  hdaniel  R       3:05      1
>>> asimov
>>>                355  asimov01      gpu  hdaniel  R       3:02      1
>>> asimov
>>>
>>> Do I need to change anything else in the configuration to support also
>>> gpu gang scheduling?
>>> Thanks
>>>
>>>
>>> ============================================================================
>>> scontrol show partition asimov01
>>> PartitionName=asimov01
>>>    AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
>>>    AllocNodes=ALL Default=YES QoS=N/A
>>>    DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
>>> Hidden=NO
>>>    MaxNodes=1 MaxTime=UNLIMITED MinNodes=0 LLN=NO
>>> MaxCPUsPerNode=UNLIMITED
>>>    Nodes=asimov
>>>    PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
>>> OverSubscribe=NO
>>>    OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
>>>    State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=NONE
>>>    JobDefaults=DefCpuPerGPU=2
>>>    DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>>>
>>> On Fri, 13 Jan 2023 at 11:16, Kevin Broch <kbroch at rivosinc.com> wrote:
>>>
>>>> Problem might be that OverSubscribe is not enabled?  w/o it, I don't
>>>> believe the time-slicing can be GANG scheduled
>>>>
>>>> Can you do a "scontrol show partition" to verify that it is?
>>>>
>>>> On Thu, Jan 12, 2023 at 6:24 PM Helder Daniel <hdaniel at ualg.pt> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am trying to enable gang scheduling on a server with a CPU with 32
>>>>> cores and 4 GPUs.
>>>>>
>>>>> However, using Gang sched, the cpu jobs (or gpu jobs) are not being
>>>>> preempted after the time slice, which is set to 30 secs.
>>>>>
>>>>> Below is a snapshot of squeue. There are 3 jobs each needing 32 cores.
>>>>> The first 2 jobs launched are never preempted. The 3rd job is forever (or
>>>>> at least until one of the other 2 ends) starving:
>>>>>
>>>>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>>>>> NODELIST(REASON)
>>>>>                313  asimov01 cpu-only  hdaniel PD       0:00      1
>>>>> (Resources)
>>>>>                311  asimov01 cpu-only  hdaniel  R       1:52      1
>>>>> asimov
>>>>>                312  asimov01 cpu-only  hdaniel  R       1:49      1
>>>>> asimov
>>>>>
>>>>> The same happens with GPU jobs. If I launch 5 jobs, requiring one GPU
>>>>> each, the 5th job will never run. The preemption is not working with the
>>>>> specified timeslice.
>>>>>
>>>>> I tried several combinations:
>>>>>
>>>>> SchedulerType=sched/builtin  and backfill
>>>>> SelectType=select/cons_tres   and linear
>>>>>
>>>>> I'll appreciate any help and suggestions
>>>>> The slurm.conf is below.
>>>>> Thanks
>>>>>
>>>>> ClusterName=asimov
>>>>> SlurmctldHost=localhost
>>>>> MpiDefault=none
>>>>> ProctrackType=proctrack/linuxproc # proctrack/cgroup
>>>>> ReturnToService=2
>>>>> SlurmctldPidFile=/var/run/slurmctld.pid
>>>>> SlurmctldPort=6817
>>>>> SlurmdPidFile=/var/run/slurmd.pid
>>>>> SlurmdPort=6818
>>>>> SlurmdSpoolDir=/var/lib/slurm/slurmd
>>>>> SlurmUser=slurm
>>>>> StateSaveLocation=/var/lib/slurm/slurmctld
>>>>> SwitchType=switch/none
>>>>> TaskPlugin=task/none # task/cgroup
>>>>> #
>>>>> # TIMERS
>>>>> InactiveLimit=0
>>>>> KillWait=30
>>>>> MinJobAge=300
>>>>> SlurmctldTimeout=120
>>>>> SlurmdTimeout=300
>>>>> Waittime=0
>>>>> #
>>>>> # SCHEDULING
>>>>> #FastSchedule=1 #obsolete
>>>>> SchedulerType=sched/builtin #backfill
>>>>> SelectType=select/cons_tres
>>>>> SelectTypeParameters=CR_Core    #CR_Core_Memory let's only one job run
>>>>> at a time
>>>>> PreemptType = preempt/partition_prio
>>>>> PreemptMode = SUSPEND,GANG
>>>>> SchedulerTimeSlice=30           #in seconds, default 30
>>>>> #
>>>>> # LOGGING AND ACCOUNTING
>>>>> #AccountingStoragePort=
>>>>> AccountingStorageType=accounting_storage/none
>>>>> #AccountingStorageEnforce=associations
>>>>> #ClusterName=bip-cluster
>>>>> JobAcctGatherFrequency=30
>>>>> JobAcctGatherType=jobacct_gather/linux
>>>>> SlurmctldDebug=info
>>>>> SlurmctldLogFile=/var/log/slurm/slurmctld.log
>>>>> SlurmdDebug=info
>>>>> SlurmdLogFile=/var/log/slurm/slurmd.log
>>>>> #
>>>>> #
>>>>> # COMPUTE NODES
>>>>> #NodeName=asimov CPUs=64 RealMemory=500 State=UNKNOWN
>>>>> #PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>>>>
>>>>> # Partitions
>>>>> GresTypes=gpu
>>>>> NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32
>>>>> ThreadsPerCore=2 State=UNKNOWN
>>>>> PartitionName=asimov01 Nodes=asimov Default=YES MaxTime=INFINITE
>>>>> MaxNodes=1 DefCpuPerGPU=2 State=UP
>>>>>
>>>>>
>>>
>>> --
>>> com os melhores cumprimentos,
>>>
>>> Helder Daniel
>>> Universidade do Algarve
>>> Faculdade de Ciências e Tecnologia
>>> Departamento de Engenharia Electrónica e Informática
>>> https://www.ualg.pt/pt/users/hdaniel
>>>
>>
>
> --
> com os melhores cumprimentos,
>
> Helder Daniel
> Universidade do Algarve
> Faculdade de Ciências e Tecnologia
> Departamento de Engenharia Electrónica e Informática
> https://www.ualg.pt/pt/users/hdaniel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230113/9ab1c6b5/attachment-0003.htm>


More information about the slurm-users mailing list