<div dir="ltr">Sorry to hear that. Hopefully others in the group have some ideas/explanations.  I haven't had to deal with GPU resources in Slurm.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jan 13, 2023 at 4:51 AM Helder Daniel <<a href="mailto:hdaniel@ualg.pt">hdaniel@ualg.pt</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Oh, ok.</div><div>I guess I was expecting that the GPU job was suspended copying GPU memory to RAM memory.<br></div><div><br></div><div>I tried also: REQUEUE,GANG and CANCEL,GANG.</div><div></div><div><br></div><div>None of these options seems to be able to preempt GPU jobs</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 13 Jan 2023 at 12:30, Kevin Broch <<a href="mailto:kbroch@rivosinc.com" target="_blank">kbroch@rivosinc.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">My guess, is that this isn't possible with GANG,SUSPEND.  GPU memory isn't managed in Slurm so the idea of suspending GPU memory for another job to use the rest simply isn't possible.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jan 13, 2023 at 4:08 AM Helder Daniel <<a href="mailto:hdaniel@ualg.pt" target="_blank">hdaniel@ualg.pt</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hi Kevin</div><div><br></div><div>I did a "scontrol show partition".</div><div>Oversubscribe was not enabled.</div><div>I enable it in slurm.conf with:</div><div><br></div><div>(...)<br></div><div>GresTypes=gpu<br>NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2 State=UNKNOWN<br></div><div>PartitionName=asimov01 <b>OverSubscribe=FORCE</b> Nodes=asimov Default=YES MaxTime=INFINITE MaxNodes=1 DefCpuPerGPU=2 State=UP</div><div><br></div><div>but now it is working only with CPU jobs. It does not preempt gpu jobs.</div><div>Lauching 3 cpu only jobs, each requiring 32 out of 64 cores it preempt after the timeslice as expected<br></div><div><br></div><div>sbatch --cpus-per-task=32 test-cpu.sh</div><div><br></div><div>             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)<br>               352  asimov01 cpu-only  hdaniel  R       0:58      1 asimov<br>               353  asimov01 cpu-only  hdaniel  R       0:25      1 asimov<br>               351  asimov01 cpu-only  hdaniel  S       0:36      1 asimov<br></div><div><br></div><div>But launching  3 GPU jobs, each requiring 2 out of 4 GPUs it does not preempt the first 2 that start running.</div><div>It says that the 3rd job is hanging on resources.<br></div><div><br></div><div>             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)<br>               356  asimov01      gpu  hdaniel PD       0:00      1 (Resources)<br>               354  asimov01      gpu  hdaniel  R       3:05      1 asimov<br>               355  asimov01      gpu  hdaniel  R       3:02      1 asimov<br></div><div><br></div><div>Do I need to change anything else in the configuration to support also gpu gang scheduling?<br></div><div>Thanks <br></div><div><br></div><div>============================================================================<br></div><div>scontrol show partition asimov01<br>PartitionName=asimov01<br>   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL<br>   AllocNodes=ALL Default=YES QoS=N/A<br>   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO<br>   MaxNodes=1 MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED<br>   Nodes=asimov<br>   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO<br>   OverTimeLimit=NONE PreemptMode=GANG,SUSPEND<br>   State=UP TotalCPUs=64 TotalNodes=1 SelectTypeParameters=NONE<br>   JobDefaults=DefCpuPerGPU=2<br>   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 13 Jan 2023 at 11:16, Kevin Broch <<a href="mailto:kbroch@rivosinc.com" target="_blank">kbroch@rivosinc.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">Problem might be that OverSubscribe is not enabled?  w/o it, I don't believe the time-slicing can be GANG scheduled<div><br></div><div>Can you do a "scontrol show partition" to verify that it is?</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jan 12, 2023 at 6:24 PM Helder Daniel <<a href="mailto:hdaniel@ualg.pt" target="_blank">hdaniel@ualg.pt</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Hi,</div><div><br></div><div>I am trying to enable gang scheduling on a server with a CPU with 32 cores and 4 GPUs.</div><div><br></div><div>However, using Gang sched, the cpu jobs (or gpu jobs) are not being preempted after the time slice, which is set to 30 secs.</div><div><br></div><div>Below is a snapshot of squeue. There are 3 jobs each needing 32 cores. The first 2 jobs launched are never preempted. The 3rd job is forever (or at least until one of the other 2 ends) starving:</div><div><br></div><div>             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)<br>               313  asimov01 cpu-only  hdaniel PD       0:00      1 (Resources)<br>               311  asimov01 cpu-only  hdaniel  R       1:52      1 asimov<br>               312  asimov01 cpu-only  hdaniel  R       1:49      1 asimov<br></div><div><br></div><div>The same happens with GPU jobs. If I launch 5 jobs, requiring one GPU each, the 5th job will never run. The preemption is not working with the specified timeslice.</div><div><br></div><div>I tried several combinations:</div><div><br></div><div>SchedulerType=sched/builtin  and backfill<br>SelectType=select/cons_tres   and linear</div><div><br></div><div>I'll appreciate any help and suggestions</div><div></div><div></div><div>The slurm.conf is below.</div><div>Thanks<br></div><div><br></div><div>ClusterName=asimov<br>SlurmctldHost=localhost<br>MpiDefault=none<br>ProctrackType=proctrack/linuxproc       # proctrack/cgroup<br>ReturnToService=2<br>SlurmctldPidFile=/var/run/slurmctld.pid<br>SlurmctldPort=6817<br>SlurmdPidFile=/var/run/slurmd.pid<br>SlurmdPort=6818<br>SlurmdSpoolDir=/var/lib/slurm/slurmd<br>SlurmUser=slurm<br>StateSaveLocation=/var/lib/slurm/slurmctld<br>SwitchType=switch/none<br>TaskPlugin=task/none         # task/cgroup<br>#<br># TIMERS<br>InactiveLimit=0<br>KillWait=30<br>MinJobAge=300<br>SlurmctldTimeout=120<br>SlurmdTimeout=300<br>Waittime=0<br>#<br># SCHEDULING<br>#FastSchedule=1                  #obsolete<br>SchedulerType=sched/builtin  #backfill<br>SelectType=select/cons_tres<br>SelectTypeParameters=CR_Core    #CR_Core_Memory let's only one job run at a time<br>PreemptType = preempt/partition_prio<br>PreemptMode = SUSPEND,GANG<br>SchedulerTimeSlice=30           #in seconds, default 30<br>#<br># LOGGING AND ACCOUNTING<br>#AccountingStoragePort=<br>AccountingStorageType=accounting_storage/none<br>#AccountingStorageEnforce=associations<br>#ClusterName=bip-cluster<br>JobAcctGatherFrequency=30<br>JobAcctGatherType=jobacct_gather/linux<br>SlurmctldDebug=info<br>SlurmctldLogFile=/var/log/slurm/slurmctld.log<br>SlurmdDebug=info<br>SlurmdLogFile=/var/log/slurm/slurmd.log<br>#<br>#<br># COMPUTE NODES<br>#NodeName=asimov CPUs=64 RealMemory=500 State=UNKNOWN<br>#PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP<br><br># Partitions<br>GresTypes=gpu<br>NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2 State=UNKNOWN<br>PartitionName=asimov01 Nodes=asimov Default=YES MaxTime=INFINITE MaxNodes=1 DefCpuPerGPU=2 State=UP<br></div><div><br></div></div>

</blockquote></div>

</blockquote></div><br clear="all"><br>-- <br><div dir="ltr"><div dir="ltr"><div>com os melhores cumprimentos,<br></div><div><br></div>Helder Daniel<br>Universidade do Algarve<br>Faculdade de Ciências e Tecnologia<br>Departamento de Engenharia Electrónica e Informática<br><a href="https://www.ualg.pt/pt/users/hdaniel" target="_blank">https://www.ualg.pt/pt/users/hdaniel</a></div></div>

</blockquote></div>

</blockquote></div><br clear="all"><br>-- <br><div dir="ltr"><div dir="ltr"><div>com os melhores cumprimentos,<br></div><div><br></div>Helder Daniel<br>Universidade do Algarve<br>Faculdade de Ciências e Tecnologia<br>Departamento de Engenharia Electrónica e Informática<br><a href="https://www.ualg.pt/pt/users/hdaniel" target="_blank">https://www.ualg.pt/pt/users/hdaniel</a></div></div>

</blockquote></div>