[slurm-users] Cannot enable Gang scheduling

Fri Jan 13 12:50:55 UTC 2023

Oh, ok.
I guess I was expecting that the GPU job was suspended copying GPU memory
to RAM memory.

I tried also: REQUEUE,GANG and CANCEL,GANG.

None of these options seems to be able to preempt GPU jobs

On Fri, 13 Jan 2023 at 12:30, Kevin Broch <kbroch at rivosinc.com> wrote:

> My guess, is that this isn't possible with GANG,SUSPEND.  GPU memory isn't
> managed in Slurm so the idea of suspending GPU memory for another job to
> use the rest simply isn't possible.
>
> On Fri, Jan 13, 2023 at 4:08 AM Helder Daniel <hdaniel at ualg.pt> wrote:
>
>> Hi Kevin
>>
>> I did a "scontrol show partition".
>> Oversubscribe was not enabled.
>> I enable it in slurm.conf with:
>>
>> (...)
>> GresTypes=gpu
>> NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2
>> State=UNKNOWN
>> PartitionName=asimov01 *OverSubscribe=FORCE* Nodes=asimov Default=YES
>> MaxTime=INFINITE MaxNodes=1 DefCpuPerGPU=2 State=UP
>>
>> but now it is working only with CPU jobs. It does not preempt gpu jobs.
>> Lauching 3 cpu only jobs, each requiring 32 out of 64 cores it preempt
>> after the timeslice as expected
>>
>> sbatch --cpus-per-task=32 test-cpu.sh
>>
>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>> NODELIST(REASON)
>>                352  asimov01 cpu-only  hdaniel  R       0:58      1 asimov
>>                353  asimov01 cpu-only  hdaniel  R       0:25      1 asimov
>>                351  asimov01 cpu-only  hdaniel  S       0:36      1 asimov
>>
>> But launching 3 GPU jobs, each requiring 2 out of 4 GPUs it does not
>> preempt the first 2 that start running.
>> It says that the 3rd job is hanging on resources.
>>
>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>> NODELIST(REASON)
>>                356  asimov01      gpu  hdaniel PD       0:00      1
>> (Resources)
>>                354  asimov01      gpu  hdaniel  R       3:05      1 asimov
>>                355  asimov01      gpu  hdaniel  R       3:02      1 asimov
>>
>> Do I need to change anything else in the configuration to support also
>> gpu gang scheduling?
>> Thanks
>>
>>
>> ============================================================================
>> scontrol show partition asimov01
>> PartitionName=asimov01
>>    AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
>>    AllocNodes=ALL Default=YES QoS=N/A
>>    DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
>> Hidden=NO
>>    MaxNodes=1 MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
>>    Nodes=asimov
>>    PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
>> OverSubscribe=NO
>>    OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
>>    State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=NONE
>>    JobDefaults=DefCpuPerGPU=2
>>    DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>>
>> On Fri, 13 Jan 2023 at 11:16, Kevin Broch <kbroch at rivosinc.com> wrote:
>>
>>> Problem might be that OverSubscribe is not enabled?  w/o it, I don't
>>> believe the time-slicing can be GANG scheduled
>>>
>>> Can you do a "scontrol show partition" to verify that it is?
>>>
>>> On Thu, Jan 12, 2023 at 6:24 PM Helder Daniel <hdaniel at ualg.pt> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am trying to enable gang scheduling on a server with a CPU with 32
>>>> cores and 4 GPUs.
>>>>
>>>> However, using Gang sched, the cpu jobs (or gpu jobs) are not being
>>>> preempted after the time slice, which is set to 30 secs.
>>>>
>>>> Below is a snapshot of squeue. There are 3 jobs each needing 32 cores.
>>>> The first 2 jobs launched are never preempted. The 3rd job is forever (or
>>>> at least until one of the other 2 ends) starving:
>>>>
>>>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>>>> NODELIST(REASON)
>>>>                313  asimov01 cpu-only  hdaniel PD       0:00      1
>>>> (Resources)
>>>>                311  asimov01 cpu-only  hdaniel  R       1:52      1
>>>> asimov
>>>>                312  asimov01 cpu-only  hdaniel  R       1:49      1
>>>> asimov
>>>>
>>>> The same happens with GPU jobs. If I launch 5 jobs, requiring one GPU
>>>> each, the 5th job will never run. The preemption is not working with the
>>>> specified timeslice.
>>>>
>>>> I tried several combinations:
>>>>
>>>> SchedulerType=sched/builtin  and backfill
>>>> SelectType=select/cons_tres   and linear
>>>>
>>>> I'll appreciate any help and suggestions
>>>> The slurm.conf is below.
>>>> Thanks
>>>>
>>>> ClusterName=asimov
>>>> SlurmctldHost=localhost
>>>> MpiDefault=none
>>>> ProctrackType=proctrack/linuxproc # proctrack/cgroup
>>>> ReturnToService=2
>>>> SlurmctldPidFile=/var/run/slurmctld.pid
>>>> SlurmctldPort=6817
>>>> SlurmdPidFile=/var/run/slurmd.pid
>>>> SlurmdPort=6818
>>>> SlurmdSpoolDir=/var/lib/slurm/slurmd
>>>> SlurmUser=slurm
>>>> StateSaveLocation=/var/lib/slurm/slurmctld
>>>> SwitchType=switch/none
>>>> TaskPlugin=task/none # task/cgroup
>>>> #
>>>> # TIMERS
>>>> InactiveLimit=0
>>>> KillWait=30
>>>> MinJobAge=300
>>>> SlurmctldTimeout=120
>>>> SlurmdTimeout=300
>>>> Waittime=0
>>>> #
>>>> # SCHEDULING
>>>> #FastSchedule=1 #obsolete
>>>> SchedulerType=sched/builtin #backfill
>>>> SelectType=select/cons_tres
>>>> SelectTypeParameters=CR_Core    #CR_Core_Memory let's only one job run
>>>> at a time
>>>> PreemptType = preempt/partition_prio
>>>> PreemptMode = SUSPEND,GANG
>>>> SchedulerTimeSlice=30           #in seconds, default 30
>>>> #
>>>> # LOGGING AND ACCOUNTING
>>>> #AccountingStoragePort=
>>>> AccountingStorageType=accounting_storage/none
>>>> #AccountingStorageEnforce=associations
>>>> #ClusterName=bip-cluster
>>>> JobAcctGatherFrequency=30
>>>> JobAcctGatherType=jobacct_gather/linux
>>>> SlurmctldDebug=info
>>>> SlurmctldLogFile=/var/log/slurm/slurmctld.log
>>>> SlurmdDebug=info
>>>> SlurmdLogFile=/var/log/slurm/slurmd.log
>>>> #
>>>> #
>>>> # COMPUTE NODES
>>>> #NodeName=asimov CPUs=64 RealMemory=500 State=UNKNOWN
>>>> #PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>>>
>>>> # Partitions
>>>> GresTypes=gpu
>>>> NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2
>>>> State=UNKNOWN
>>>> PartitionName=asimov01 Nodes=asimov Default=YES MaxTime=INFINITE
>>>> MaxNodes=1 DefCpuPerGPU=2 State=UP
>>>>
>>>>
>>
>> --
>> com os melhores cumprimentos,
>>
>> Helder Daniel
>> Universidade do Algarve
>> Faculdade de Ciências e Tecnologia
>> Departamento de Engenharia Electrónica e Informática
>> https://www.ualg.pt/pt/users/hdaniel
>>
>

-- 
com os melhores cumprimentos,

Helder Daniel
Universidade do Algarve
Faculdade de Ciências e Tecnologia
Departamento de Engenharia Electrónica e Informática
https://www.ualg.pt/pt/users/hdaniel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230113/cb843cf1/attachment-0003.htm>