[slurm-users] Shards distribution over multiple GPUs

12 Mar 2024


      Hi all,
We're trying to enable sharding on our compute cluster.
On this cluster:
- ensicompute-1 comes with 1 NVIDIA V100 GPU ;
- ensicompute-13 comes with 3 NVIDIA A40 GPUs ;
- all other nodes (for now, ensicompute-11 and ensicompute-12, but several others will come) come with 3 NVIDIA RTX 6000 GPUs.
To enable sharding, I followed these steps:
1. [slurm.conf] Add "shard" to GresTypes ;
2. [slurm.conf] Add "shard:N" to Gres for each node. For testing purposes, I have set N to 9, so each GPU can execute up to 3 jobs concurrently:
NodeName=ensicompute-[11-12] Gres=gpu:Quadro:3,shard:9 CPUs=40 RealMemory=128520 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN Feature=gpu,ht
3. [gres.conf] Declare the shards after the definition of the GPUs GRES.
For step 3, I tried different things, leading to different outcomes:
a. Define a global number of shards, for the entire host:
Name=shard Count=9
==> This way, sharding seems to work ok, but all the jobs are executed on GPU#0. If running 12 jobs for example, 9 of them are assigned to GPU#0 and start executing, while 3 of them remain in a pending state. No job is assigned to GPU#1 or GPU#2.
b. Define a per-GPU number of shards, associated to the device file representing the GPU:
Name=shard Count=3 File=/dev/nvidia0
Name=shard Count=3 File=/dev/nvidia1
Name=shard Count=3 File=/dev/nvidia2
==> In this case, the slurmd service fails to start on the compute node. The error message found in /var/log/slurmd.log is "fatal: Invalid GRES record for shard, count does not match File value".
c. Don't define anything about shards in gres.conf.
==> Same behavior than in a.: all jobs are executed on GPU#0.
I attach to this message the full content of the slurm.conf and gres.conf files.
What is the proper way to configure sharding in a cluster with several GPUs per node?
Is there a way to specify how many shards should be allocated to each GPU?
Cheers,
François
=== slurm.conf ===
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ClusterName=ensimag
SlurmctldHost=nash
ProctrackType=proctrack/cgroup
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
TaskPlugin=task/affinity,task/cgroup
ReturnToService=2
#
#
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
#
#
# SCHEDULING
SchedulerType=sched/backfill
SelectType=select/cons_tres
#
#
# LOGGING AND ACCOUNTING
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#
#
# COMPUTE NODES
GresTypes=gpu,shard
NodeName=ensicompute-1 Gres=gpu:Tesla:1,shard:3 CPUs=40 RealMemory=128520 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN Feature=gpu,ht
NodeName=ensicompute-13 Gres=gpu:A40:3,shard:9 CPUs=40 RealMemory=128520 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN Feature=gpu,ht
NodeName=ensicompute-[11-12] Gres=gpu:Quadro:3,shard:9 CPUs=40 RealMemory=128520 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN Feature=gpu,ht
PartitionName=compute Nodes=ALL Default=YES MaxTime=INFINITE State=UP
=== gres.conf ===
AutoDetect=off
# ensicompute-1
NodeName=ensicompute-1 Name=gpu Type=Tesla File=/dev/nvidia0
NodeName=ensicompute-1 Name=shard Count=3 File=/dev/nvidia0
# ensicompute-11
NodeName=ensicompute-11 Name=gpu Type=Quadro File=/dev/nvidia0
NodeName=ensicompute-11 Name=gpu Type=Quadro File=/dev/nvidia1
NodeName=ensicompute-11 Name=gpu Type=Quadro File=/dev/nvidia2
NodeName=ensicompute-11 Name=shard Count=3 File=/dev/nvidia0
NodeName=ensicompute-11 Name=shard Count=3 File=/dev/nvidia1
NodeName=ensicompute-11 Name=shard Count=3 File=/dev/nvidia2
# ensicompute-12
NodeName=ensicompute-12 Name=gpu Type=Quadro File=/dev/nvidia0
NodeName=ensicompute-12 Name=gpu Type=Quadro File=/dev/nvidia1
NodeName=ensicompute-12 Name=gpu Type=Quadro File=/dev/nvidia2
NodeName=ensicompute-12 Name=shard Count=3 File=/dev/nvidia0
NodeName=ensicompute-12 Name=shard Count=3 File=/dev/nvidia1
NodeName=ensicompute-12 Name=shard Count=3 File=/dev/nvidia2
# ensicompute-13
NodeName=ensicompute-13 Name=gpu Type=A40 File=/dev/nvidia0
NodeName=ensicompute-13 Name=gpu Type=A40 File=/dev/nvidia1
NodeName=ensicompute-13 Name=gpu Type=A40 File=/dev/nvidia2
NodeName=ensicompute-13 Name=shard Count=3 File=/dev/nvidia0
NodeName=ensicompute-13 Name=shard Count=3 File=/dev/nvidia1
NodeName=ensicompute-13 Name=shard Count=3 File=/dev/nvidia2
--
François Broquedis, Ingénieur Service Informatique
Grenoble INP - Ensimag, bureau E208
681 rue de la Passerelle
BP 72, 38402 Saint Martin d'Hères CEDEX
Tél.: +33 (0)4 76 82 72 78

2025

2024

[slurm-users] Shards distribution over multiple GPUs