[slurm-users] How to tell SLURM to ignore specific GPUs

Stephan Roth stephan.roth at ee.ethz.ch
Mon Jan 31 20:54:59 UTC 2022

Not a solution, but some ideas & experiences concerning the same topic:

A few of our older GPUs used to show the error message "has fallen off 
the bus" which was only resolved by a full power cycle as well.

Something changed, nowadays the error messages is "GPU lost" and a 
normal reboot resolves the problem. This might be a result of an update 
of the Nvidia drivers (currently 60.73.01), but I can't be sure.

The current behaviour allowed us to write a script checking GPU state 
every 10 minutes and setting a node to drain&reboot state when such a 
"lost GPU" is detected.
This has been working well for a couple of months now and saves us time.

It might help as well to re-seat all GPUs and PCI risers, this also 
seemed to help in one of our GPU nodes. Again, I can't be sure, we'd 
need to do try this with other - still failing - GPUs.

The problem is to identify the cards physically from the information we 
have, like what's reported with nvidia-smi or available in 
The serial number isn't shown for every type of GPU and I'm not sure the 
ones shown match the stickers on the GPUs.
If anybody were to know of a practical solution for this, I'd be happy 
to read it.

Eventually I'd like to pull out all cards which repeatedly get "lost" 
and maybe move them all to a node for short debug jobs or throw them 
away (they're all beyond warranty anyway).


On 31.01.22 15:45, Timony, Mick wrote:
>     I have a large compute node with 10 RTX8000 cards at a remote colo.
>     One of the cards on it is acting up "falling of the bus" once a day
>     requiring a full power cycle to reset.
>     I want jobs to avoid that card as well as the card it is NVLINK'ed to.
>     So I modified gres.conf on that node as follows:
>     # cat /etc/slurm/gres.conf
>     AutoDetect=nvml
>     Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia0
>     Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia1
>     #Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia2
>     Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia3
>     #Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia4
>     Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia5
>     Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia6
>     Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia7
>     Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia8
>     Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia9
>     and it slurm.conf I changed for node def Gres=gpu:quadro_rtx_8000:10
>     to be Gres=gpu:quadro_rtx_8000:8.  I restarted slurmctld and slurmd
>     after this.
>     I then put the node back from drain to idle.  Jobs were sumbitted and
>     started on the node but they are using the GPU I told it to avoid
>     +--------------------------------------------------------------------+
>     | Processes:                                                         |
>     |  GPU   GI   CI        PID   Type   Process name         GPU Memory |
>     |        ID   ID                                          Usage      |
>     |====================================================================|
>     |    0   N/A  N/A     63426      C   python                 11293MiB |
>     |    1   N/A  N/A     63425      C   python                 11293MiB |
>     |    2   N/A  N/A     63425      C   python                 10869MiB |
>     |    2   N/A  N/A     63426      C   python                 10869MiB |
>     |    4   N/A  N/A     63425      C   python                 10849MiB |
>     |    4   N/A  N/A     63426      C   python                 10849MiB |
>     +--------------------------------------------------------------------+
>     How can I make SLURM not use GPU 2 and 4?
>     ---------------------------------------------------------------
>     Paul Raines http://help.nmr.mgh.harvard.edu
>     <http://help.nmr.mgh.harvard.edu>
>     MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
>     149 (2301) 13th Street     Charlestown, MA 02129            USA
> You can use the nvidia-smi command to 'drain' the GPU's which will 
> power-down the GPU's and no applications will use them.
> This thread on stack overflow explains how to do that:
> https://unix.stackexchange.com/a/654089/94412 
> <https://unix.stackexchange.com/a/654089/94412>
> You can create a script to run at boot and 'drain' the cards.
> Regards
> --Mick
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4252 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220131/de38c968/attachment-0001.bin>

More information about the slurm-users mailing list