<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
The numbering seen from nvidia-smi is not necessarily the same as the order of /dev/nvidiaXX.<br>
There is a way to force that, though, using <span style="color:rgb(51, 51, 51);font-family:monospace, monospace;font-size:13.6px;background-color:rgba(27, 31, 35, 0.05);display:inline !important">
CUDA_​DEVICE_​ORDER</span>. </div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
See <a href="https://shawnliu.me/post/nvidia-gpu-id-enumeration-in-linux/" id="LPlnk240533">https://shawnliu.me/post/nvidia-gpu-id-enumeration-in-linux/</a></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Cheers,</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Esben</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> slurm-users <slurm-users-bounces@lists.schedmd.com> on behalf of Timony, Mick <Michael_Timony@hms.harvard.edu><br>
<b>Sent:</b> Monday, January 31, 2022 15:45<br>
<b>To:</b> slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com><br>
<b>Subject:</b> Re: [slurm-users] How to tell SLURM to ignore specific GPUs</font>
<div> </div>
</div>
<style type="text/css" style="display:none">
<!--
p
        {margin-top:0;
        margin-bottom:0}
-->
</style>
<div dir="ltr">
<blockquote itemscope="" itemtype="https://schemas.microsoft.com/QuotedText" style="border-left:3px solid rgb(200,200,200); border-top-color:rgb(200,200,200); border-right-color:rgb(200,200,200); border-bottom-color:rgb(200,200,200); padding-left:1ex; margin-left:0.8ex">
<div style="font-family:Arial,Helvetica,sans-serif; font-size:10pt; color:rgb(0,0,0); background-color:rgb(255,255,255)">
I have a large compute node with 10 RTX8000 cards at a remote colo.<br>
</div>
<div class="x_BodyFragment"><font size="2"><span style="font-size:11pt">
<div class="x_PlainText">One of the cards on it is acting up "falling of the bus" once a day<br>
requiring a full power cycle to reset.<br>
<br>
I want jobs to avoid that card as well as the card it is NVLINK'ed to.<br>
<br>
So I modified gres.conf on that node as follows:<br>
<br>
<br>
# cat /etc/slurm/gres.conf<br>
AutoDetect=nvml<br>
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia0<br>
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia1<br>
#Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia2<br>
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia3<br>
#Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia4<br>
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia5<br>
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia6<br>
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia7<br>
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia8<br>
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia9<br>
<br>
and it slurm.conf I changed for node def Gres=gpu:quadro_rtx_8000:10<br>
to be Gres=gpu:quadro_rtx_8000:8.  I restarted slurmctld and slurmd<br>
after this.<br>
<br>
I then put the node back from drain to idle.  Jobs were sumbitted and <br>
started on the node but they are using the GPU I told it to avoid<br>
<br>
+--------------------------------------------------------------------+<br>
| Processes:                                                         |<br>
|  GPU   GI   CI        PID   Type   Process name         GPU Memory |<br>
|        ID   ID                                          Usage      |<br>
|====================================================================|<br>
|    0   N/A  N/A     63426      C   python                 11293MiB |<br>
|    1   N/A  N/A     63425      C   python                 11293MiB |<br>
|    2   N/A  N/A     63425      C   python                 10869MiB |<br>
|    2   N/A  N/A     63426      C   python                 10869MiB |<br>
|    4   N/A  N/A     63425      C   python                 10849MiB |<br>
|    4   N/A  N/A     63426      C   python                 10849MiB |<br>
+--------------------------------------------------------------------+<br>
<br>
How can I make SLURM not use GPU 2 and 4?<br>
<br>
---------------------------------------------------------------<br>
Paul Raines                     <a href="https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhelp.nmr.mgh.harvard.edu%2F&data=04%7C01%7Cepf%40novozymes.com%7C295ce6d305984d38921e08d9e4c88781%7C43d5f49ee03a4d22a2285684196bb001%7C0%7C0%7C637792372191800703%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=oxpga348rfrZpOg0XSDepHfdUHirfgq46c6ZXcYoHvI%3D&reserved=0" originalsrc="http://help.nmr.mgh.harvard.edu/" shash="P/0glRWkOcyGTQMw+2e2JLVz+jnt0M7ngwIu4rsTcaifKMs4e7kYOgvWNwP+Ct/Ypq2YKZRcXQ9hMf8j4FAkbdscGH0kC5IG2tPrlCgTomrIC9DqxGMcrNQiNysL99ogT/rYlOPoElUEjvjfvfDtH82DZs3KbDjjKU93Rc3ELcE=">
http://help.nmr.mgh.harvard.edu</a><br>
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging<br>
149 (2301) 13th Street     Charlestown, MA 02129            USA<br>
<br>
</div>
</span></font></div>
</blockquote>
<div class="x_BodyFragment"><font size="2"><span style="font-size:11pt">
<div class="x_PlainText"><br>
You can use the nvidia-smi command to 'drain' the GPU's which will power-down the GPU's and no applications will use them.<br>
<br>
This thread on stack overflow explains how to do that:<br>
<br>
<a href="https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Funix.stackexchange.com%2Fa%2F654089%2F94412&data=04%7C01%7Cepf%40novozymes.com%7C295ce6d305984d38921e08d9e4c88781%7C43d5f49ee03a4d22a2285684196bb001%7C0%7C0%7C637792372191956924%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Z8EBb1jxUD0sECO2R1m0CYIn4xy6HA%2Fx5AsqIBykoCY%3D&reserved=0" originalsrc="https://unix.stackexchange.com/a/654089/94412" shash="g4Zo3+FXWP8rYa2sdHdtIHiGqu4rwUT+71tlOslTkstdzKLCUMiM1zHwLooK5jZHUuS9F1ZS4MNJ5Gdqclbk2sKd21mZUhZR7NTnM57UA6iJtc9ekqPuQZcxYqw7pc/IJtML9X3+kUktUvmgpywX850mcpGqZX/W13Q0xvfNo+M=" id="LPlnk217751">https://unix.stackexchange.com/a/654089/94412</a><br>
<br>
You can create a script to run at boot and 'drain' the cards.</div>
<div class="x_PlainText"><br>
</div>
<div class="x_PlainText">Regards</div>
<div class="x_PlainText">--Mick<br>
<br>
<br>
</div>
</span></font></div>
</div>
</body>
</html>