[slurm-users] How to tell SLURM to ignore specific GPUs

EPF (Esben Peter Friis) EPF at novozymes.com
Tue Feb 1 08:09:34 UTC 2022

The numbering seen from nvidia-smi is not necessarily the same as the order of /dev/nvidiaXX.
There is a way to force that, though, using CUDA_​DEVICE_​ORDER.

See https://shawnliu.me/post/nvidia-gpu-id-enumeration-in-linux/


From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Timony, Mick <Michael_Timony at hms.harvard.edu>
Sent: Monday, January 31, 2022 15:45
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] How to tell SLURM to ignore specific GPUs

I have a large compute node with 10 RTX8000 cards at a remote colo.
One of the cards on it is acting up "falling of the bus" once a day
requiring a full power cycle to reset.

I want jobs to avoid that card as well as the card it is NVLINK'ed to.

So I modified gres.conf on that node as follows:

# cat /etc/slurm/gres.conf
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia0
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia1
#Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia2
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia3
#Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia4
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia5
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia6
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia7
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia8
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia9

and it slurm.conf I changed for node def Gres=gpu:quadro_rtx_8000:10
to be Gres=gpu:quadro_rtx_8000:8.  I restarted slurmctld and slurmd
after this.

I then put the node back from drain to idle.  Jobs were sumbitted and
started on the node but they are using the GPU I told it to avoid

| Processes:                                                         |
|  GPU   GI   CI        PID   Type   Process name         GPU Memory |
|        ID   ID                                          Usage      |
|    0   N/A  N/A     63426      C   python                 11293MiB |
|    1   N/A  N/A     63425      C   python                 11293MiB |
|    2   N/A  N/A     63425      C   python                 10869MiB |
|    2   N/A  N/A     63426      C   python                 10869MiB |
|    4   N/A  N/A     63425      C   python                 10849MiB |
|    4   N/A  N/A     63426      C   python                 10849MiB |

How can I make SLURM not use GPU 2 and 4?

http://help.nmr.mgh.harvard.edu
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street     Charlestown, MA 02129            USA

You can use the nvidia-smi command to 'drain' the GPU's which will power-down the GPU's and no applications will use them.

This thread on stack overflow explains how to do that:


You can create a script to run at boot and 'drain' the cards.


