[slurm-users] Cannot enable Gang scheduling

Fri Jan 13 12:29:26 UTC 2023

PS: I checked the resources while running the 3 GPU jobs which where
launched with:

sbatch --gpus-per-task=2 --cpus-per-task=1 cnn-multi.sh

The server have 64 cores (32 x2 with hyperthreading)

cat /proc/cpuinfo | grep processor | tail -n1
processor : 63

128 GB main memory:

hdaniel at asimov:~/Works/Turbines/02-CNN$ cat /proc/meminfo
MemTotal:       131725276 kB
MemFree:        106773356 kB
MemAvailable:   109398780 kB
Buffers:          161012 kB
(...)

And 4 GPUs each with 16GB memory:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8
  |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr.
ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute
M. |
|                               |                      |               MIG
M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A4000    On   | 00000000:41:00.0 Off |
 Off |
| 45%   63C    P2    47W / 140W |  15370MiB / 16376MiB |     14%
 Default |
|                               |                      |
 N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A4000    On   | 00000000:42:00.0 Off |
 Off |
| 44%   63C    P2    45W / 140W |  15370MiB / 16376MiB |     14%
 Default |
|                               |                      |
 N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A4000    On   | 00000000:61:00.0 Off |
 Off |
| 50%   68C    P2    52W / 140W |  15370MiB / 16376MiB |     15%
 Default |
|                               |                      |
 N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A4000    On   | 00000000:62:00.0 Off |
 Off |
| 46%   64C    P2    47W / 140W |  15370MiB / 16376MiB |     14%
 Default |
|                               |                      |
 N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:
   |
|  GPU   GI   CI        PID   Type   Process name                  GPU
Memory |
|        ID   ID                                                   Usage
   |
|=============================================================================|
|    0   N/A  N/A      2146      G   /usr/lib/xorg/Xorg
 9MiB |
|    0   N/A  N/A      2472      G   /usr/bin/gnome-shell
 4MiB |
|    0   N/A  N/A    524228      C   /bin/python
15352MiB |
|    1   N/A  N/A      2146      G   /usr/lib/xorg/Xorg
 4MiB |
|    1   N/A  N/A    524228      C   /bin/python
15362MiB |
|    2   N/A  N/A      2146      G   /usr/lib/xorg/Xorg
 4MiB |
|    2   N/A  N/A    524226      C   /bin/python
15362MiB |
|    3   N/A  N/A      2146      G   /usr/lib/xorg/Xorg
 4MiB |
|    3   N/A  N/A    524226      C   /bin/python
15362MiB |
+-----------------------------------------------------------------------------+

On Fri, 13 Jan 2023 at 12:08, Helder Daniel <hdaniel at ualg.pt> wrote:

> Hi Kevin
>
> I did a "scontrol show partition".
> Oversubscribe was not enabled.
> I enable it in slurm.conf with:
>
> (...)
> GresTypes=gpu
> NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2
> State=UNKNOWN
> PartitionName=asimov01 *OverSubscribe=FORCE* Nodes=asimov Default=YES
> MaxTime=INFINITE MaxNodes=1 DefCpuPerGPU=2 State=UP
>
> but now it is working only with CPU jobs. It does not preempt gpu jobs.
> Lauching 3 cpu only jobs, each requiring 32 out of 64 cores it preempt
> after the timeslice as expected
>
> sbatch --cpus-per-task=32 test-cpu.sh
>
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>                352  asimov01 cpu-only  hdaniel  R       0:58      1 asimov
>                353  asimov01 cpu-only  hdaniel  R       0:25      1 asimov
>                351  asimov01 cpu-only  hdaniel  S       0:36      1 asimov
>
> But launching 3 GPU jobs, each requiring 2 out of 4 GPUs it does not
> preempt the first 2 that start running.
> It says that the 3rd job is hanging on resources.
>
>              JOBID PARTITION     NAME     USER ST       TIME  NODES
> NODELIST(REASON)
>                356  asimov01      gpu  hdaniel PD       0:00      1
> (Resources)
>                354  asimov01      gpu  hdaniel  R       3:05      1 asimov
>                355  asimov01      gpu  hdaniel  R       3:02      1 asimov
>
> Do I need to change anything else in the configuration to support also gpu
> gang scheduling?
> Thanks
>
>
> ============================================================================
> scontrol show partition asimov01
> PartitionName=asimov01
>    AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
>    AllocNodes=ALL Default=YES QoS=N/A
>    DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0
> Hidden=NO
>    MaxNodes=1 MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
>    Nodes=asimov
>    PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
> OverSubscribe=NO
>    OverTimeLimit=NONE PreemptMode=GANG,SUSPEND
>    State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=NONE
>    JobDefaults=DefCpuPerGPU=2
>    DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
>
> On Fri, 13 Jan 2023 at 11:16, Kevin Broch <kbroch at rivosinc.com> wrote:
>
>> Problem might be that OverSubscribe is not enabled?  w/o it, I don't
>> believe the time-slicing can be GANG scheduled
>>
>> Can you do a "scontrol show partition" to verify that it is?
>>
>> On Thu, Jan 12, 2023 at 6:24 PM Helder Daniel <hdaniel at ualg.pt> wrote:
>>
>>> Hi,
>>>
>>> I am trying to enable gang scheduling on a server with a CPU with 32
>>> cores and 4 GPUs.
>>>
>>> However, using Gang sched, the cpu jobs (or gpu jobs) are not being
>>> preempted after the time slice, which is set to 30 secs.
>>>
>>> Below is a snapshot of squeue. There are 3 jobs each needing 32 cores.
>>> The first 2 jobs launched are never preempted. The 3rd job is forever (or
>>> at least until one of the other 2 ends) starving:
>>>
>>>              JOBID PARTITION     NAME     USER ST       TIME  NODES
>>> NODELIST(REASON)
>>>                313  asimov01 cpu-only  hdaniel PD       0:00      1
>>> (Resources)
>>>                311  asimov01 cpu-only  hdaniel  R       1:52      1
>>> asimov
>>>                312  asimov01 cpu-only  hdaniel  R       1:49      1
>>> asimov
>>>
>>> The same happens with GPU jobs. If I launch 5 jobs, requiring one GPU
>>> each, the 5th job will never run. The preemption is not working with the
>>> specified timeslice.
>>>
>>> I tried several combinations:
>>>
>>> SchedulerType=sched/builtin  and backfill
>>> SelectType=select/cons_tres   and linear
>>>
>>> I'll appreciate any help and suggestions
>>> The slurm.conf is below.
>>> Thanks
>>>
>>> ClusterName=asimov
>>> SlurmctldHost=localhost
>>> MpiDefault=none
>>> ProctrackType=proctrack/linuxproc # proctrack/cgroup
>>> ReturnToService=2
>>> SlurmctldPidFile=/var/run/slurmctld.pid
>>> SlurmctldPort=6817
>>> SlurmdPidFile=/var/run/slurmd.pid
>>> SlurmdPort=6818
>>> SlurmdSpoolDir=/var/lib/slurm/slurmd
>>> SlurmUser=slurm
>>> StateSaveLocation=/var/lib/slurm/slurmctld
>>> SwitchType=switch/none
>>> TaskPlugin=task/none # task/cgroup
>>> #
>>> # TIMERS
>>> InactiveLimit=0
>>> KillWait=30
>>> MinJobAge=300
>>> SlurmctldTimeout=120
>>> SlurmdTimeout=300
>>> Waittime=0
>>> #
>>> # SCHEDULING
>>> #FastSchedule=1 #obsolete
>>> SchedulerType=sched/builtin #backfill
>>> SelectType=select/cons_tres
>>> SelectTypeParameters=CR_Core    #CR_Core_Memory let's only one job run
>>> at a time
>>> PreemptType = preempt/partition_prio
>>> PreemptMode = SUSPEND,GANG
>>> SchedulerTimeSlice=30           #in seconds, default 30
>>> #
>>> # LOGGING AND ACCOUNTING
>>> #AccountingStoragePort=
>>> AccountingStorageType=accounting_storage/none
>>> #AccountingStorageEnforce=associations
>>> #ClusterName=bip-cluster
>>> JobAcctGatherFrequency=30
>>> JobAcctGatherType=jobacct_gather/linux
>>> SlurmctldDebug=info
>>> SlurmctldLogFile=/var/log/slurm/slurmctld.log
>>> SlurmdDebug=info
>>> SlurmdLogFile=/var/log/slurm/slurmd.log
>>> #
>>> #
>>> # COMPUTE NODES
>>> #NodeName=asimov CPUs=64 RealMemory=500 State=UNKNOWN
>>> #PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>>>
>>> # Partitions
>>> GresTypes=gpu
>>> NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2
>>> State=UNKNOWN
>>> PartitionName=asimov01 Nodes=asimov Default=YES MaxTime=INFINITE
>>> MaxNodes=1 DefCpuPerGPU=2 State=UP
>>>
>>>
>
> --
> com os melhores cumprimentos,
>
> Helder Daniel
> Universidade do Algarve
> Faculdade de Ciências e Tecnologia
> Departamento de Engenharia Electrónica e Informática
> https://www.ualg.pt/pt/users/hdaniel
>

-- 
com os melhores cumprimentos,

Helder Daniel
Universidade do Algarve
Faculdade de Ciências e Tecnologia
Departamento de Engenharia Electrónica e Informática
https://www.ualg.pt/pt/users/hdaniel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230113/b2bece46/attachment-0003.htm>