<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto">Hi Quirin maybe you have this gres issue<div><br></div><div><a href="https://bugs.schedmd.com/show_bug.cgi?id=12642#c27">https://bugs.schedmd.com/show_bug.cgi?id=12642#c27</a><br><br><div dir="ltr">--<div>Bas van der Vlies<div><br></div></div></div><div dir="ltr"><br><blockquote type="cite">On 17 Oct 2021, at 16:32, Quirin Lohr <quirin.lohr@in.tum.de> wrote:<br><br></blockquote></div><blockquote type="cite"><div dir="ltr"><span>Hi,</span><br><span></span><br><span>I just upgraded from 20.11 to 21.08.2.</span><br><span></span><br><span>Now it seems the slurmd cannot handle my custom GRES.</span><br><span>I have set VRAM of the GPUs as a custom GRES, to allow users to select a GPU with enough VRAM for their jobs.</span><br><span></span><br><span>I defined the VRAM in gres.conf:</span><br><span></span><br><blockquote type="cite"><span>NodeName=node[1,7,9] Name=VRAM Count=24G Flags=CountOnly</span><br></blockquote><blockquote type="cite"><span>NodeName=node[2-6] Name=VRAM Count=12G Flags=CountOnly</span><br></blockquote><blockquote type="cite"><span>NodeName=node[8,10] Name=VRAM Count=16G Flags=CountOnly</span><br></blockquote><blockquote type="cite"><span>NodeName=node[11-14] Name=VRAM Count=48G Flags=CountOnly</span><br></blockquote><span></span><br><span></span><br><span></span><br><span>and in slurm.conf:</span><br><blockquote type="cite"><span>AccountingStorageTRES=gres/gpu,gres/gpu:p6000,gres/gpu:titan,gres/VRAM,gres/gpu:rtx_5000,gres/gpu:rtx_6000,gres/gpu:rtx_8000,gres/gpu:rtx_a6000</span><br></blockquote><blockquote type="cite"><span>GresTypes=gpu,VRAM</span><br></blockquote><blockquote type="cite"><span>NodeName=node1  CPUs=24 Boards=1 SocketsPerBoard=2 CoresPerSocket=12 ThreadsPerCore=1 RealMemory=230000  Weight=30 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,p6000      Gres=gpu:p6000:4,VRAM:no_consume:24G</span><br></blockquote><blockquote type="cite"><span>NodeName=node2  CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=490000  Weight=20 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan      Gres=gpu:titan:7,VRAM:no_consume:12G</span><br></blockquote><blockquote type="cite"><span>NodeName=node3  CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=490000  Weight=21 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan      Gres=gpu:titan:8,VRAM:no_consume:12G</span><br></blockquote><blockquote type="cite"><span>NodeName=node4  CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=490000  Weight=22 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan      Gres=gpu:titan:8,VRAM:no_consume:12G</span><br></blockquote><blockquote type="cite"><span>NodeName=node5  CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=490000  Weight=23 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan      Gres=gpu:titan:8,VRAM:no_consume:12G</span><br></blockquote><blockquote type="cite"><span>NodeName=node6  CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=490000  Weight=24 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,titan      Gres=gpu:titan:8,VRAM:no_consume:12G</span><br></blockquote><blockquote type="cite"><span>NodeName=node7  CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=490000  Weight=31 Feature=CPU_GEN:SBEP,CPU_SKU=E5-26,p6000      Gres=gpu:p6000:8,VRAM:no_consume:24G</span><br></blockquote><blockquote type="cite"><span>NodeName=node8  CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=360000  Weight=40 Feature=CPU_GEN:SKYL,CPU_SKU=GOLD-61,rtx_5000 Gres=gpu:rtx_5000:9,VRAM:no_consume:16G</span><br></blockquote><blockquote type="cite"><span>NodeName=node9  CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=360000  Weight=50 Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_6000   Gres=gpu:rtx_6000:9,VRAM:no_consume:24G</span><br></blockquote><blockquote type="cite"><span>NodeName=node10 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=1 RealMemory=360000  Weight=41 Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_5000   Gres=gpu:rtx_5000:9,VRAM:no_consume:16G</span><br></blockquote><blockquote type="cite"><span>NodeName=node11 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=1500000 Weight=60 Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_8000   Gres=gpu:rtx_8000:9,VRAM:no_consume:48G</span><br></blockquote><blockquote type="cite"><span>NodeName=node12 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=1500000 Weight=61 Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_8000   Gres=gpu:rtx_8000:9,VRAM:no_consume:48G</span><br></blockquote><blockquote type="cite"><span>NodeName=node13 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=1500000 Weight=62 Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_8000   Gres=gpu:rtx_8000:9,VRAM:no_consume:48G</span><br></blockquote><blockquote type="cite"><span>NodeName=node14 CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=1500000 Weight=63 Feature=CPU_GEN:CL,CPU_SKU=GOLD-62,rtx_a6000  Gres=gpu:rtx_a6000:8,VRAM:no_consume:48G</span><br></blockquote><span></span><br><span></span><br><span>If I want to run a job with only specifying --gpu=1 it gets executed on node2, if I add --gres=VRAM:32G it gets scheduled to node12, but then terminated with "Invalid generic resource (gres) specification".</span><br><span></span><br><span>So I understand that the scheduler knows about the gres/VRAM, but the slurmd does not.</span><br><span>Was there any change to this, and how can I get the old behaviour back?</span><br><span></span><br><span>Thanks in advance</span><br><span>Quirin Lohr</span><br><span></span><br><blockquote type="cite"><span>srun: defined options</span><br></blockquote><blockquote type="cite"><span>srun: -------------------- --------------------</span><br></blockquote><blockquote type="cite"><span>srun: gpus                : 1</span><br></blockquote><blockquote type="cite"><span>srun: gres                : gres:VRAM:32G</span><br></blockquote><blockquote type="cite"><span>srun: verbose             : 1</span><br></blockquote><blockquote type="cite"><span>srun: -------------------- --------------------</span><br></blockquote><blockquote type="cite"><span>srun: end of defined options</span><br></blockquote><blockquote type="cite"><span>srun: Waiting for nodes to boot (delay looping 4650 times @ 0.100000 secs x index)</span><br></blockquote><blockquote type="cite"><span>srun: Nodes node12 are ready for job</span><br></blockquote><blockquote type="cite"><span>srun: jobid 571261: nodes(1):`node12', cpu counts: 1(x1)</span><br></blockquote><blockquote type="cite"><span>srun: error: Unable to create step for job 571261: Invalid generic resource (gres) specification</span><br></blockquote><span></span><br><span></span><br><span></span><br><span></span><br><span>sacctmgr show tres:</span><br><blockquote type="cite"><span>    Type            Name     ID</span><br></blockquote><blockquote type="cite"><span>-------- --------------- ------</span><br></blockquote><blockquote type="cite"><span>     cpu                      1</span><br></blockquote><blockquote type="cite"><span>     mem                      2</span><br></blockquote><blockquote type="cite"><span>  energy                      3</span><br></blockquote><blockquote type="cite"><span>    node                      4</span><br></blockquote><blockquote type="cite"><span> billing                      5</span><br></blockquote><blockquote type="cite"><span>      fs            disk      6</span><br></blockquote><blockquote type="cite"><span>    vmem                      7</span><br></blockquote><blockquote type="cite"><span>   pages                      8</span><br></blockquote><blockquote type="cite"><span>    gres             gpu   1001</span><br></blockquote><blockquote type="cite"><span>    gres       gpu:p6000   1002</span><br></blockquote><blockquote type="cite"><span>    gres     gpu:titanxp   1003</span><br></blockquote><blockquote type="cite"><span>    gres            vram   1004</span><br></blockquote><blockquote type="cite"><span>    gres gpu:titanxpasc+   1005</span><br></blockquote><blockquote type="cite"><span>    gres       cudacores   1006</span><br></blockquote><blockquote type="cite"><span>    gres     gpu:rtx5000   1007</span><br></blockquote><blockquote type="cite"><span>    gres     gpu:rtx6000   1008</span><br></blockquote><blockquote type="cite"><span>    gres             mps   1009</span><br></blockquote><blockquote type="cite"><span>    gres     mps:rtx5000   1010</span><br></blockquote><blockquote type="cite"><span>    gres     mps:rtx6000   1011</span><br></blockquote><blockquote type="cite"><span>    gres     gpu:rtx8000   1012</span><br></blockquote><blockquote type="cite"><span>    gres       gpu:titan   1013</span><br></blockquote><blockquote type="cite"><span>    gres    gpu:rtx_5000   1014</span><br></blockquote><blockquote type="cite"><span>    gres    gpu:rtx_6000   1015</span><br></blockquote><blockquote type="cite"><span>    gres    gpu:rtx_8000   1016</span><br></blockquote><blockquote type="cite"><span>    gres   gpu:rtx_a6000   1017</span><br></blockquote><span></span><br><span></span><br><span></span><br><span>-- </span><br><span>Quirin Lohr</span><br><span>Systemadministration</span><br><span>Technische Universität München</span><br><span>Fakultät für Informatik</span><br><span>Lehrstuhl für Bildverarbeitung und Künstliche Intelligenz</span><br><span></span><br><span>Boltzmannstrasse 3</span><br><span>85748 Garching</span><br><span></span><br><span>Tel. +49 89 289 17769</span><br><span>Fax +49 89 289 17757</span><br><span></span><br><span>quirin.lohr@in.tum.de</span><br><span>www.vision.in.tum.de</span><br><span></span><br></div></blockquote></div></body></html>