[slurm-users] Custom GRES not working in 21.08.2

Quirin Lohr quirin.lohr at in.tum.de
Sun Oct 17 14:13:06 UTC 2021


Hi,

I just upgraded from 20.11 to 21.08.2.

Now it seems the slurmd cannot handle my custom GRES.
I have set VRAM of the GPUs as a custom GRES, to allow users to select a 
GPU with enough VRAM for their jobs.

I defined the VRAM in gres.conf:

> NodeName=node[1,7,9] Name=VRAM Count=24G Flags=CountOnly
> NodeName=node[2-6] Name=VRAM Count=12G Flags=CountOnly
> NodeName=node[8,10] Name=VRAM Count=16G Flags=CountOnly
> NodeName=node[11-14] Name=VRAM Count=48G Flags=CountOnly



and in slurm.conf:
> AccountingStorageTRES=gres/gpu,gres/gpu:p6000,gres/gpu:titan,gres/VRAM,gres/gpu:rtx_5000,gres/gpu:rtx_6000,gres/gpu:rtx_8000,gres/gpu:rtx_a6000
> GresTypes=gpu,VRAM
> NodeName=node1  CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=230000  Weight=30 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,p6000      Gres=gpu:p6000:4,VRAM:no_consume:24G
> NodeName=node2  CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=490000  Weight=20 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan      Gres=gpu:titan:7,VRAM:no_consume:12G
> NodeName=node3  CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=490000  Weight=21 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan      Gres=gpu:titan:8,VRAM:no_consume:12G
> NodeName=node4  CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=490000  Weight=22 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan      Gres=gpu:titan:8,VRAM:no_consume:12G
> NodeName=node5  CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=490000  Weight=23 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan      Gres=gpu:titan:8,VRAM:no_consume:12G
> NodeName=node6  CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=490000  Weight=24 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan      Gres=gpu:titan:8,VRAM:no_consume:12G
> NodeName=node7  CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=490000  Weight=31 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,p6000      Gres=gpu:p6000:8,VRAM:no_consume:24G
> NodeName=node8  CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=360000  Weight=40 Feature=CPU_GEN:SKYL,CPU_SKU=GOLD-61,rtx_5000 Gres=gpu:rtx_5000:9,VRAM:no_consume:16G
> NodeName=node9  CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=360000  Weight=50 Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_6000   Gres=gpu:rtx_6000:9,VRAM:no_consume:24G
> NodeName=node10 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=360000  Weight=41 Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_5000   Gres=gpu:rtx_5000:9,VRAM:no_consume:16G
> NodeName=node11 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=1500000 Weight=60 Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_8000   Gres=gpu:rtx_8000:9,VRAM:no_consume:48G
> NodeName=node12 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=1500000 Weight=61 Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_8000   Gres=gpu:rtx_8000:9,VRAM:no_consume:48G
> NodeName=node13 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=1500000 Weight=62 Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_8000   Gres=gpu:rtx_8000:9,VRAM:no_consume:48G
> NodeName=node14 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=1500000 Weight=63 Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_a6000  Gres=gpu:rtx_a6000:8,VRAM:no_consume:48G


If I want to run a job with only specifying --gpu=1 it gets executed on 
node2, if I add --gres=VRAM:32G it gets scheduled to node12, but then 
terminated with "Invalid generic resource (gres) specification".

So I understand that the scheduler knows about the gres/VRAM, but the 
slurmd does not.
Was there any change to this, and how can I get the old behaviour back?

Thanks in advance
Quirin Lohr

> srun: defined options
> srun: -------------------- --------------------
> srun: gpus                : 1
> srun: gres                : gres:VRAM:32G
> srun: verbose             : 1
> srun: -------------------- --------------------
> srun: end of defined options
> srun: Waiting for nodes to boot (delay looping 4650 times @ 0.100000 secs x index)
> srun: Nodes node12 are ready for job
> srun: jobid 571261: nodes(1):`node12', cpu counts: 1(x1)
> srun: error: Unable to create step for job 571261: Invalid generic resource (gres) specification




sacctmgr show tres:
>     Type            Name     ID
> -------- --------------- ------
>      cpu                      1
>      mem                      2
>   energy                      3
>     node                      4
>  billing                      5
>       fs            disk      6
>     vmem                      7
>    pages                      8
>     gres             gpu   1001
>     gres       gpu:p6000   1002
>     gres     gpu:titanxp   1003
>     gres            vram   1004
>     gres gpu:titanxpasc+   1005
>     gres       cudacores   1006
>     gres     gpu:rtx5000   1007
>     gres     gpu:rtx6000   1008
>     gres             mps   1009
>     gres     mps:rtx5000   1010
>     gres     mps:rtx6000   1011
>     gres     gpu:rtx8000   1012
>     gres       gpu:titan   1013
>     gres    gpu:rtx_5000   1014
>     gres    gpu:rtx_6000   1015
>     gres    gpu:rtx_8000   1016
>     gres   gpu:rtx_a6000   1017



-- 
Quirin Lohr
Systemadministration
Technische Universität München
Fakultät für Informatik
Lehrstuhl für Bildverarbeitung und Künstliche Intelligenz

Boltzmannstrasse 3
85748 Garching

Tel. +49 89 289 17769
Fax +49 89 289 17757

quirin.lohr at in.tum.de
www.vision.in.tum.de

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5563 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211017/06fe1944/attachment.bin>


More information about the slurm-users mailing list