[slurm-users] Coexisting jobs with gres/shard and gres/gpu in the same GPU
Reppas, Andreas
andreas.reppas at charite.de
Mon Aug 28 12:56:21 UTC 2023
Hi everyone,
we have recently enabled sharding to allow GPU sharing by multiple jobs. According to SLURM documentation: once a GPU has been allocated as a gres/gpu resource it will not be available as a gres/shard (and vice versa).
However, we had the situation where, on nodes with a single GPU, jobs that allocate gres/shard and other jobs that allocate gres/gpu were running simultaneously. Has anyone encountered the same case?
An example of this gres/gpu and gres/shard coexistence can be seen in the following:
squeue -w s-sc-gpu017
JOBID PARTITION NAME USER STAT TIME TIME_LIMI NODES NODELIST(REASON) CPUS MIN_MEMORY
1288220 gpu spawner-ju xx1 RUNN 1-20:49:59 2-00:00:00 1 s-sc-gpu017 32 120G
1291298 gpu interactiv xx2 RUNN 13:40 8:00:00 1 s-sc-gpu017 8 32000M
scontrol show job 1288220 | grep TRES
ReqTRES=cpu=32,mem=120G,node=1,billing=62,gres/shard=1
AllocTRES=cpu=32,mem=120G,node=1,billing=62,gres/shard=1
scontrol show job 1291298 | grep TRES
ReqTRES=cpu=1,mem=32000M,node=1,billing=136,gres/gpu=1
AllocTRES=cpu=8,mem=32000M,node=1,billing=143,gres/gpu=1,gres/gpu:nvidia_a100-pcie-40gb=1
And the information on the node status shows both gres/gpu and gres/shard on the allocated TRES:
scontrol show node s-sc-gpu017 | grep TRES
CfgTRES=cpu=128,mem=500000M,billing=378,gres/gpu=1,gres/gpu:nvidia_a100-pcie-40gb=1,gres/shard=4
AllocTRES=cpu=40,mem=154880M,gres/gpu=1,gres/gpu:nvidia_a100-pcie-40gb=1,gres/shard=1
We are running Slurm Version 23.02.4 on Rocky Linux 8.5 and the shard related configuration in slurm.conf is as follows:
GresTypes=gpu,shard,gpu/gfx90a,gpu/nvidia_a100-pcie-40gb,gpu/nvidia_a100-sxm4-40gb,gpu/nvidia_a100-sxm4-80gb,gpu/nvidia_a100_80gb_pcie
AccountingStorageTRES=gres/gpu,gres/shard,gres/gpu:gfx90a,gres/gpu:nvidia_a100-pcie-40gb,gres/gpu:nvidia_a100-sxm4-40gb,gres/gpu:nvidia_a100-sxm4-80gb,gres/gpu:nvidia_a100_80gb_pcie
NodeName=s-sc-gpu003 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=500000 Gres=gpu:nvidia_a100-pcie-40gb:1,shard:4 State=UNKNOWN Weight=1
NodeName=s-sc-gpu017 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=500000 Gres=gpu:nvidia_a100-pcie-40gb:1,shard:4 State=UNKNOWN Weight=1
NodeName=s-sc-gpu018 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=500000 Gres=gpu:nvidia_a100-pcie-40gb:1,shard:4 State=UNKNOWN Weight=1
NodeName=s-sc-gpu019 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=500000 Gres=gpu:nvidia_a100-pcie-40gb:1,shard:4 State=UNKNOWN Weight=1
NodeName=s-sc-gpu021 CPUs=128 Boards=1 SocketsPerBoard=2 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=500000 Gres=gpu:nvidia_a100-pcie-40gb:1,shard:4 State=UNKNOWN Weight=1
Kind Regards,
Andreas
-----------
Dr. Andreas Reppas
Geschäftsbereich IT | Scientific Computing
Charité – Universitätsmedizin Berlin
Campus Charité Virchow Klinikum
Forum 4 | Ebene 02 | Raum 2.020
Augustenburger Platz 1
13353 Berlin
andreas.reppas at charite.de
https://www.charite.de
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230828/4f16454c/attachment.htm>
More information about the slurm-users
mailing list