[slurm-users] Cannot enable Gang scheduling

Helder Daniel hdaniel at ualg.pt
Fri Jan 13 02:21:38 UTC 2023


Hi,

I am trying to enable gang scheduling on a server with a CPU with 32 cores
and 4 GPUs.

However, using Gang sched, the cpu jobs (or gpu jobs) are not being
preempted after the time slice, which is set to 30 secs.

Below is a snapshot of squeue. There are 3 jobs each needing 32 cores. The
first 2 jobs launched are never preempted. The 3rd job is forever (or at
least until one of the other 2 ends) starving:

             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
               313  asimov01 cpu-only  hdaniel PD       0:00      1
(Resources)
               311  asimov01 cpu-only  hdaniel  R       1:52      1 asimov
               312  asimov01 cpu-only  hdaniel  R       1:49      1 asimov

The same happens with GPU jobs. If I launch 5 jobs, requiring one GPU each,
the 5th job will never run. The preemption is not working with the
specified timeslice.

I tried several combinations:

SchedulerType=sched/builtin  and backfill
SelectType=select/cons_tres   and linear

I'll appreciate any help and suggestions
The slurm.conf is below.
Thanks

ClusterName=asimov
SlurmctldHost=localhost
MpiDefault=none
ProctrackType=proctrack/linuxproc # proctrack/cgroup
ReturnToService=2
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm/slurmctld
SwitchType=switch/none
TaskPlugin=task/none # task/cgroup
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
#
# SCHEDULING
#FastSchedule=1 #obsolete
SchedulerType=sched/builtin #backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core    #CR_Core_Memory let's only one job run at a
time
PreemptType = preempt/partition_prio
PreemptMode = SUSPEND,GANG
SchedulerTimeSlice=30           #in seconds, default 30
#
# LOGGING AND ACCOUNTING
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageEnforce=associations
#ClusterName=bip-cluster
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
#
#
# COMPUTE NODES
#NodeName=asimov CPUs=64 RealMemory=500 State=UNKNOWN
#PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP

# Partitions
GresTypes=gpu
NodeName=asimov Gres=gpu:4 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2
State=UNKNOWN
PartitionName=asimov01 Nodes=asimov Default=YES MaxTime=INFINITE MaxNodes=1
DefCpuPerGPU=2 State=UP
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230113/4a19ec06/attachment.htm>


More information about the slurm-users mailing list