<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div class="elementToProof"><span style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof">Update:</span></div>
<div class="elementToProof"><span style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof"><br>
</span></div>
<div class="elementToProof"><span style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof">If I call the smaller card "Quadro" rather that "RTX5000", is works
correctly</span></div>
<div class="elementToProof"><span style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof"><br>
</span></div>
<div class="elementToProof"><span style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof">in slurm.comf:</span></div>
<div class="elementToProof"><span style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof"><br>
</span></div>
<div class="elementToProof"><span style="font-family: "Courier New", monospace; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof ContentPasted0">NodeName=koala NodeAddr=10.194.132.190 Boards=1 SocketsPerBoard=2
CoresPerSocket=14 ThreadsPerCore=2 Gres=gpu:A5000:3,gpu:Quadro:1,shard:88 Feature=gpu,ht</span><span style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof ContentPasted0"><br class="FluidPluginCopy ContentPasted0">
<br>
</span></div>
<div class="elementToProof"><span style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof">is gres.conf:</span></div>
<div class="elementToProof"><span style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof"><br>
</span></div>
<div class="elementToProof"><span style="color: rgb(0, 0, 0); font-family: "Courier New", monospace; font-size: 12pt;">AutoDetect=nvml</span></div>
<div class="elementToProof"><span style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof ContentPasted1">
<div class="FluidPluginCopy elementToProof"><span style="font-family: "Courier New", monospace;">Name=gpu Type=A5000 File=/dev/nvidia0</span><br>
</div>
<div class="FluidPluginCopy ContentPasted1"><span style="font-family: "Courier New", monospace;">Name=gpu Type=A5000 File=/dev/nvidia1</span></div>
<div class="FluidPluginCopy ContentPasted1"><span style="font-family: "Courier New", monospace;">Name=gpu Type=Quadro File=/dev/nvidia2</span></div>
<div class="FluidPluginCopy ContentPasted1"><span style="font-family: "Courier New", monospace;">Name=gpu Type=A5000 File=/dev/nvidia3</span></div>
<div class="FluidPluginCopy ContentPasted1"><span style="font-family: "Courier New", monospace;">Name=shard Count=24 File=/dev/nvidia0</span></div>
<div class="FluidPluginCopy ContentPasted1"><span style="font-family: "Courier New", monospace;">Name=shard Count=24 File=/dev/nvidia1</span></div>
<div class="FluidPluginCopy ContentPasted1 elementToProof"><span style="font-family: "Courier New", monospace;">Name=shard Count=16 File=/dev/nvidia2</span></div>
<div class="FluidPluginCopy ContentPasted1 elementToProof"><span style="font-family: "Courier New", monospace;">Name=shard Count=24 File=/dev/nvidia3</span></div>
<div class="FluidPluginCopy ContentPasted1 elementToProof"><br>
</div>
<div class="FluidPluginCopy ContentPasted1 elementToProof"><br>
</div>
<div class="FluidPluginCopy ContentPasted1 elementToProof">Does the name string have to be (part of) what nvidia-smi or NVML reports? </div>
<div class="FluidPluginCopy ContentPasted1 elementToProof"><br>
</div>
<div class="FluidPluginCopy ContentPasted1 elementToProof">Cheers,</div>
<div class="FluidPluginCopy ContentPasted1 elementToProof"><br>
</div>
<div class="FluidPluginCopy ContentPasted1 elementToProof">Esben</div>
<div class="FluidPluginCopy ContentPasted1 elementToProof"><br>
</div>
</span></div>
<div class="elementToProof"><span style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0); background-color: rgb(255, 255, 255);" class="elementToProof"><br>
</span></div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> EPF (Esben Peter Friis) <EPF@novozymes.com><br>
<b>Sent:</b> Thursday, January 5, 2023 16:51<br>
<b>To:</b> slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com><br>
<b>Subject:</b> Sharding not working correctly if several gpu types are defined </font>
<div> </div>
</div>
<style type="text/css" style="display:none">
<!--
p
{margin-top:0;
margin-bottom:0}
-->
</style>
<div dir="ltr">
<div class="x_elementToProof"><span class="x_elementToProof" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0); background-color:rgb(255,255,255)">Really great that there is now a way to share GPUs between several jobs
- even with several GPUs per host. Thanks for adding this feature!<br>
<br>
</span></div>
<div class="x_elementToProof"><span style="color:rgb(0,0,0); font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">I have compiled (against cuda 11.8) and installed 22.05.7. </span><br>
</div>
<div class="x_elementToProof"><span style="color:rgb(0,0,0); font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">The test system is one host with 4 GPUS (3 x Nvidia A5000 </span><span style="color:rgb(0,0,0); font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">+
1 x Nivida RTX5000)</span></div>
<div class="x_elementToProof"><span style="color:rgb(0,0,0); font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt"><br>
</span></div>
<div class="x_elementToProof"><span style="color:rgb(0,0,0); font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">nvidia-smi reports this:</span></div>
<div class="x_elementToProof"><span style="color:rgb(0,0,0); font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt"><br>
</span></div>
<div class="x_elementToProof"><span class="x_ContentPasted2" style="color:rgb(0,0,0); font-family:"Courier New",monospace; font-size:9pt">+-----------------------------------------------------------------------------+</span><span class="x_ContentPasted2" style="color:rgb(0,0,0); font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">
<div class="x_FluidPluginCopy x_ContentPasted2"><span style="font-family:"Courier New",monospace; font-size:9pt">| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted2"><span style="font-family:"Courier New",monospace; font-size:9pt">|-------------------------------+----------------------+----------------------+</span></div>
<div class="x_FluidPluginCopy x_ContentPasted2"><span style="font-family:"Courier New",monospace; font-size:9pt">| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted2"><span style="font-family:"Courier New",monospace; font-size:9pt">| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted2"><span style="font-family:"Courier New",monospace; font-size:9pt">| | | MIG M. |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted2"><span style="font-family:"Courier New",monospace; font-size:9pt">|===============================+======================+======================|</span></div>
<div class="x_FluidPluginCopy x_ContentPasted2"><span style="font-family:"Courier New",monospace; font-size:9pt">| 0 NVIDIA RTX A5000 On | 00000000:02:00.0 Off | Off |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted2"><span style="font-family:"Courier New",monospace; font-size:9pt">| 42% 62C P2 88W / 230W | 207MiB / 24564MiB | 0% Default |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted2"><span style="font-family:"Courier New",monospace; font-size:9pt">| | | N/A |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted2"><span style="font-family:"Courier New",monospace; font-size:9pt">+-------------------------------+----------------------+----------------------+</span></div>
<div class="x_FluidPluginCopy x_ContentPasted2"><span style="font-family:"Courier New",monospace; font-size:9pt">| 1 NVIDIA RTX A5000 On | 00000000:03:00.0 Off | Off |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted2"><span style="font-family:"Courier New",monospace; font-size:9pt">| 45% 61C P5 80W / 230W | 3MiB / 24564MiB | 0% Default |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted2"><span style="font-family:"Courier New",monospace; font-size:9pt">| | | N/A |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted2"><span style="font-family:"Courier New",monospace; font-size:9pt">+-------------------------------+----------------------+----------------------+</span></div>
<div class="x_FluidPluginCopy x_ContentPasted2"><span style="font-family:"Courier New",monospace; font-size:9pt">| 2 Quadro RTX 5000 On | 00000000:83:00.0 Off | Off |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted2"><span style="font-family:"Courier New",monospace; font-size:9pt">| 51% 63C P0 67W / 230W | 3MiB / 16384MiB | 0% Default |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted2"><span style="font-family:"Courier New",monospace; font-size:9pt">| | | N/A |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted2"><span style="font-family:"Courier New",monospace; font-size:9pt">+-------------------------------+----------------------+----------------------+</span></div>
<div class="x_FluidPluginCopy x_ContentPasted2"><span style="font-family:"Courier New",monospace; font-size:9pt">| 3 NVIDIA RTX A5000 On | 00000000:84:00.0 Off | Off |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted2"><span style="font-family:"Courier New",monospace; font-size:9pt">| 31% 52C P0 64W / 230W | 3MiB / 24564MiB | 0% Default |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted2"><span style="font-family:"Courier New",monospace; font-size:9pt">| | | N/A |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted2 x_elementToProof"><span style="font-family:"Courier New",monospace; font-size:9pt">+-------------------------------+----------------------+----------------------+</span></div>
<br>
</span></div>
<div class="x_elementToProof"><span style="color:rgb(0,0,0); font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">My gres.conf is this. The RTX5000 has less memory, so we configure it with less shards:</span><br>
</div>
<div class="x_elementToProof"><span class="x_ContentPasted0 x_elementToProof" style="color:rgb(0,0,0); font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">
<div class="x_FluidPluginCopy x_ContentPasted0 x_elementToProof"><br>
</div>
<div class="x_FluidPluginCopy x_ContentPasted0 x_elementToProof"><span style="font-family:"Courier New",monospace">AutoDetect=nvml</span><br>
</div>
<div class="x_FluidPluginCopy x_elementToProof"><span style="font-family:"Courier New",monospace">Name=gpu Type=A5000 File=/dev/nvidia0</span><br>
</div>
<div class="x_FluidPluginCopy x_ContentPasted0 x_elementToProof"><span style="font-family:"Courier New",monospace">Name=gpu Type=A5000 File=/dev/nvidia1</span></div>
<div class="x_FluidPluginCopy x_ContentPasted0 x_elementToProof"><span style="font-family:"Courier New",monospace">Name=gpu Type=RTX5000 File=/dev/nvidia2</span></div>
<div class="x_FluidPluginCopy x_ContentPasted0"><span style="font-family:"Courier New",monospace">Name=gpu Type=A5000 File=/dev/nvidia3</span></div>
<div class="x_FluidPluginCopy x_ContentPasted0"><span style="font-family:"Courier New",monospace">Name=shard Count=24 File=/dev/nvidia0</span></div>
<div class="x_FluidPluginCopy x_ContentPasted0 x_elementToProof"><span style="font-family:"Courier New",monospace">Name=shard Count=24 File=/dev/nvidia1</span></div>
<div class="x_FluidPluginCopy x_ContentPasted0"><span style="font-family:"Courier New",monospace">Name=shard Count=16 File=/dev/nvidia2</span></div>
<div class="x_FluidPluginCopy x_ContentPasted0"><span style="font-family:"Courier New",monospace">Name=shard Count=24 File=/dev/nvidia3</span></div>
</span></div>
<div class="x_elementToProof"><span style="color:rgb(0,0,0); font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt"><br>
</span></div>
<div class="x_elementToProof" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0); background-color:rgb(255,255,255)">
<br>
</div>
<div class="x_elementToProof"><span class="x_ContentPasted1 x_elementToProof" style="color:rgb(0,0,0); font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">if I don't configure gpus by type - like this in slurm.conf:
<div class="x_FluidPluginCopy"><br class="x_ContentPasted1">
</div>
<div class="x_FluidPluginCopy x_ContentPasted1"><span style="font-family:"Courier New",monospace">NodeName=koala NodeAddr=10.194.132.190 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 Gres=gpu:4,shard:88 Feature=gpu,ht</span></div>
<div class="x_FluidPluginCopy"><br class="x_ContentPasted1">
</div>
<div class="x_FluidPluginCopy"><br class="x_ContentPasted1">
</div>
<div class="x_FluidPluginCopy x_ContentPasted1 x_elementToProof">and run 7 jobs, each requesting 12 shards, it works exacly as expected: 2 jobs on each of the A5000's and one job on the RTX5000. (Subsequent jobs requesting 12 shards are correctly queued)
</div>
<div class="x_FluidPluginCopy"><br class="x_ContentPasted1">
</div>
<div class="x_FluidPluginCopy x_ContentPasted1"><span style="font-family:"Courier New",monospace; font-size:9pt">+-----------------------------------------------------------------------------+</span></div>
<div class="x_FluidPluginCopy x_ContentPasted1"><span style="font-family:"Courier New",monospace; font-size:9pt">| Processes: |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted1"><span style="font-family:"Courier New",monospace; font-size:9pt">| GPU GI CI PID Type Process name GPU Memory |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted1"><span style="font-family:"Courier New",monospace; font-size:9pt">| ID ID Usage |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted1"><span style="font-family:"Courier New",monospace; font-size:9pt">|=============================================================================|</span></div>
<div class="x_FluidPluginCopy x_ContentPasted1"><span style="font-family:"Courier New",monospace; font-size:9pt">| 0 N/A N/A 904552 C ...ing_proj/.venv/bin/python 204MiB |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted1"><span style="font-family:"Courier New",monospace; font-size:9pt">| 0 N/A N/A 1160663 C ...-2020-ubuntu20.04/bin/gmx 260MiB |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted1"><span style="font-family:"Courier New",monospace; font-size:9pt">| 0 N/A N/A 1160758 C ...-2020-ubuntu20.04/bin/gmx 254MiB |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted1"><span style="font-family:"Courier New",monospace; font-size:9pt">| 1 N/A N/A 1160643 C ...-2020-ubuntu20.04/bin/gmx 262MiB |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted1"><span style="font-family:"Courier New",monospace; font-size:9pt">| 1 N/A N/A 1160647 C ...-2020-ubuntu20.04/bin/gmx 256MiB |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted1"><span style="font-family:"Courier New",monospace; font-size:9pt">| 2 N/A N/A 1160659 C ...-2020-ubuntu20.04/bin/gmx 174MiB |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted1"><span style="font-family:"Courier New",monospace; font-size:9pt">| 3 N/A N/A 1160644 C ...-2020-ubuntu20.04/bin/gmx 248MiB |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted1"><span style="font-family:"Courier New",monospace; font-size:9pt">| 3 N/A N/A 1160755 C ...-2020-ubuntu20.04/bin/gmx 260MiB |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted1"><span style="font-family:"Courier New",monospace; font-size:9pt">+-----------------------------------------------------------------------------+</span></div>
<div class="x_FluidPluginCopy"><br class="x_ContentPasted1">
</div>
That's great! <br>
If we run jobs requiring one or more full GPUs, ee would like to be able to request specific GPU types as well </span></div>
<div class="x_elementToProof"><span class="x_ContentPasted1 x_elementToProof" style="color:rgb(0,0,0); font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt"><br>
</span></div>
<div class="x_elementToProof"><span class="x_ContentPasted1 x_elementToProof x_ContentPasted3" style="color:rgb(0,0,0); font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt">But if I configure the gpus also by name like this in slurm.conf:
<div class="x_FluidPluginCopy"><br class="x_ContentPasted3">
</div>
<div class="x_FluidPluginCopy x_ContentPasted3"><span style="font-family:"Courier New",monospace">NodeName=koala NodeAddr=10.194.132.190 Boards=1 SocketsPerBoard=2 CoresPerSocket=14 ThreadsPerCore=2 Gres=gpu:A5000:3,gpu:RTX5000:1,shard:88 Feature=gpu,ht</span></div>
<div class="x_FluidPluginCopy"><br class="x_ContentPasted3">
</div>
<div class="x_FluidPluginCopy x_ContentPasted3 x_elementToProof">and run 7 jobs, each requesting 12 shards, It does NOT Work. It starts 2 jobs on the first two A5000's, two job on the RTX5000, and one job on last A5000. Strangely, it still knows that it should
not start more jobs - subsequent jobs are still queued. </div>
<div class="x_FluidPluginCopy"><br class="x_ContentPasted3">
</div>
<div class="x_FluidPluginCopy x_ContentPasted3"><span style="font-family:"Courier New",monospace; font-size:9pt">+-----------------------------------------------------------------------------+</span></div>
<div class="x_FluidPluginCopy x_ContentPasted3"><span style="font-family:"Courier New",monospace; font-size:9pt">| Processes: |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted3"><span style="font-family:"Courier New",monospace; font-size:9pt">| GPU GI CI PID Type Process name GPU Memory |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted3"><span style="font-family:"Courier New",monospace; font-size:9pt">| ID ID Usage |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted3"><span style="font-family:"Courier New",monospace; font-size:9pt">|=============================================================================|</span></div>
<div class="x_FluidPluginCopy x_ContentPasted3"><span style="font-family:"Courier New",monospace; font-size:9pt">| 0 N/A N/A 904552 C ...ing_proj/.venv/bin/python 204MiB |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted3"><span style="font-family:"Courier New",monospace; font-size:9pt">| 0 N/A N/A 1176564 C ...-2020-ubuntu20.04/bin/gmx 258MiB |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted3"><span style="font-family:"Courier New",monospace; font-size:9pt">| 0 N/A N/A 1176565 C ...-2020-ubuntu20.04/bin/gmx 258MiB |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted3"><span style="font-family:"Courier New",monospace; font-size:9pt">| 1 N/A N/A 1176562 C ...-2020-ubuntu20.04/bin/gmx 258MiB |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted3"><span style="font-family:"Courier New",monospace; font-size:9pt">| 1 N/A N/A 1176566 C ...-2020-ubuntu20.04/bin/gmx 258MiB |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted3"><span style="font-family:"Courier New",monospace; font-size:9pt">| 2 N/A N/A 1176560 C ...-2020-ubuntu20.04/bin/gmx 172MiB |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted3"><span style="font-family:"Courier New",monospace; font-size:9pt">| 2 N/A N/A 1176561 C ...-2020-ubuntu20.04/bin/gmx 172MiB |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted3"><span style="font-family:"Courier New",monospace; font-size:9pt">| 3 N/A N/A 1176563 C ...-2020-ubuntu20.04/bin/gmx 258MiB |</span></div>
<div class="x_FluidPluginCopy x_ContentPasted3"><span style="font-family:"Courier New",monospace; font-size:9pt">+-----------------------------------------------------------------------------+</span></div>
<div class="x_FluidPluginCopy"><br class="x_ContentPasted3">
</div>
<div class="x_FluidPluginCopy x_elementToProof"><br class="x_ContentPasted3">
</div>
<div class="x_FluidPluginCopy x_elementToProof">It is also strange that "scontrol show node" seems to list the shards correctly, even in this case:</div>
<div class="x_FluidPluginCopy x_elementToProof"><br>
</div>
<div class="x_FluidPluginCopy x_elementToProof x_ContentPasted5"><span style="font-family:"Courier New",monospace; font-size:9pt">NodeName=koala Arch=x86_64 CoresPerSocket=14
</span>
<div class="x_FluidPluginCopy x_ContentPasted5"><span style="font-family:"Courier New",monospace; font-size:9pt"> CPUAlloc=0 CPUEfctv=56 CPUTot=56 CPULoad=22.16</span></div>
<div class="x_FluidPluginCopy x_ContentPasted5"><span style="font-family:"Courier New",monospace; font-size:9pt"> AvailableFeatures=gpu,ht</span></div>
<div class="x_FluidPluginCopy x_ContentPasted5"><span style="font-family:"Courier New",monospace; font-size:9pt"> ActiveFeatures=gpu,ht</span></div>
<div class="x_FluidPluginCopy x_ContentPasted5"><span style="font-family:"Courier New",monospace; font-size:9pt"> Gres=gpu:A5000:3(S:0-1),gpu:RTX5000:1(S:0-1),shard:A5000:72(S:0-1),shard:RTX5000:16(S:0-1)</span></div>
<div class="x_FluidPluginCopy x_ContentPasted5"><span style="font-family:"Courier New",monospace; font-size:9pt"> NodeAddr=10.194.132.190 NodeHostName=koala Version=22.05.7</span></div>
<div class="x_FluidPluginCopy x_ContentPasted5"><span style="font-family:"Courier New",monospace; font-size:9pt"> OS=Linux 5.4.0-135-generic #152-Ubuntu SMP Wed Nov 23 20:19:22 UTC 2022
</span></div>
<div class="x_FluidPluginCopy x_ContentPasted5"><span style="font-family:"Courier New",monospace; font-size:9pt"> RealMemory=1 AllocMem=0 FreeMem=390036 Sockets=2 Boards=1</span></div>
<div class="x_FluidPluginCopy x_ContentPasted5"><span style="font-family:"Courier New",monospace; font-size:9pt"> State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A</span></div>
<div class="x_FluidPluginCopy x_ContentPasted5"><span style="font-family:"Courier New",monospace; font-size:9pt"> Partitions=urgent,high,medium,low
</span></div>
<div class="x_FluidPluginCopy x_ContentPasted5"><span style="font-family:"Courier New",monospace; font-size:9pt"> BootTime=2023-01-03T12:37:17 SlurmdStartTime=2023-01-05T16:24:53</span></div>
<div class="x_FluidPluginCopy x_ContentPasted5"><span style="font-family:"Courier New",monospace; font-size:9pt"> LastBusyTime=2023-01-05T16:37:24</span></div>
<div class="x_FluidPluginCopy x_ContentPasted5"><span style="font-family:"Courier New",monospace; font-size:9pt"> CfgTRES=cpu=56,mem=1M,billing=56,gres/gpu=4,gres/shard=88</span></div>
<div class="x_FluidPluginCopy x_ContentPasted5"><span style="font-family:"Courier New",monospace; font-size:9pt"> AllocTRES=</span></div>
<div class="x_FluidPluginCopy x_ContentPasted5"><span style="font-family:"Courier New",monospace; font-size:9pt"> CapWatts=n/a</span></div>
<div class="x_FluidPluginCopy x_ContentPasted5"><span style="font-family:"Courier New",monospace; font-size:9pt"> CurrentWatts=0 AveWatts=0</span></div>
<div class="x_FluidPluginCopy x_ContentPasted5"><span style="font-family:"Courier New",monospace; font-size:9pt"> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s</span></div>
<div class="x_FluidPluginCopy"><br class="x_ContentPasted5">
</div>
In all cases, my jobs are submitted with commands like this:</div>
<div class="x_FluidPluginCopy x_elementToProof x_ContentPasted5"><br>
</div>
<div class="x_FluidPluginCopy x_elementToProof x_ContentPasted5 x_ContentPasted6">
<span style="font-family:"Courier New",monospace">sbatch --gres=shard:12 --wrap 'bash -c " ... (command goes here) ... "'</span><br>
</div>
<div class="x_FluidPluginCopy x_elementToProof x_ContentPasted5"><br>
</div>
<div class="x_FluidPluginCopy x_elementToProof x_ContentPasted5"><br>
</div>
<div class="x_FluidPluginCopy x_elementToProof"><br>
</div>
<div class="x_FluidPluginCopy x_elementToProof"><span class="x_FluidPluginCopy x_ContentPasted3 x_ContentPasted4" style="margin:0px">The behavior is very consistent. I have played around with adding CUDA_DEVICE_ORDER=PCI_BUS_ID to the environment of slurmd
and slurmctld, but it makes no difference.</span><br class="x_FluidPluginCopy x_ContentPasted4">
<span class="x_FluidPluginCopy" style="margin:0px"></span><br>
</div>
<div class="x_FluidPluginCopy x_elementToProof">Is this a bug or a feature?<br>
</div>
<br>
</span></div>
<div class="x_elementToProof"><span class="x_elementToProof" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0); background-color:rgb(255,255,255)">Cheers,</span></div>
<div class="x_elementToProof"><span class="x_elementToProof" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0); background-color:rgb(255,255,255)"><br>
</span></div>
<div class="x_elementToProof"><span class="x_elementToProof" style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0); background-color:rgb(255,255,255)">Esben</span></div>
</div>
</body>
</html>