[slurm-users] How to tell SLURM to ignore specific GPUs

Mon Jan 31 20:54:59 UTC 2022

Not a solution, but some ideas & experiences concerning the same topic:

A few of our older GPUs used to show the error message "has fallen off 
the bus" which was only resolved by a full power cycle as well.

Something changed, nowadays the error messages is "GPU lost" and a 
normal reboot resolves the problem. This might be a result of an update 
of the Nvidia drivers (currently 60.73.01), but I can't be sure.

The current behaviour allowed us to write a script checking GPU state 
every 10 minutes and setting a node to drain&reboot state when such a 
"lost GPU" is detected.
This has been working well for a couple of months now and saves us time.

It might help as well to re-seat all GPUs and PCI risers, this also 
seemed to help in one of our GPU nodes. Again, I can't be sure, we'd 
need to do try this with other - still failing - GPUs.

The problem is to identify the cards physically from the information we 
have, like what's reported with nvidia-smi or available in 
/proc/driver/nvidia/gpus/*/information
The serial number isn't shown for every type of GPU and I'm not sure the 
ones shown match the stickers on the GPUs.
If anybody were to know of a practical solution for this, I'd be happy 
to read it.

Eventually I'd like to pull out all cards which repeatedly get "lost" 
and maybe move them all to a node for short debug jobs or throw them 
away (they're all beyond warranty anyway).

Stephan

On 31.01.22 15:45, Timony, Mick wrote:
>     I have a large compute node with 10 RTX8000 cards at a remote colo.
>     One of the cards on it is acting up "falling of the bus" once a day
>     requiring a full power cycle to reset.
> 
>     I want jobs to avoid that card as well as the card it is NVLINK'ed to.
> 
>     So I modified gres.conf on that node as follows:
> 
> 
>     # cat /etc/slurm/gres.conf
>     AutoDetect=nvml
>     Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia0
>     Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia1
>     #Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia2
>     Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia3
>     #Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia4
>     Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia5
>     Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia6
>     Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia7
>     Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia8
>     Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia9
> 
>     and it slurm.conf I changed for node def Gres=gpu:quadro_rtx_8000:10
>     to be Gres=gpu:quadro_rtx_8000:8.  I restarted slurmctld and slurmd
>     after this.
> 
>     I then put the node back from drain to idle.  Jobs were sumbitted and
>     started on the node but they are using the GPU I told it to avoid
> 
>     +--------------------------------------------------------------------+
>     | Processes:                                                         |
>     |  GPU   GI   CI        PID   Type   Process name         GPU Memory |
>     |        ID   ID                                          Usage      |
>     |====================================================================|
>     |    0   N/A  N/A     63426      C   python                 11293MiB |
>     |    1   N/A  N/A     63425      C   python                 11293MiB |
>     |    2   N/A  N/A     63425      C   python                 10869MiB |
>     |    2   N/A  N/A     63426      C   python                 10869MiB |
>     |    4   N/A  N/A     63425      C   python                 10849MiB |
>     |    4   N/A  N/A     63426      C   python                 10849MiB |
>     +--------------------------------------------------------------------+
> 
>     How can I make SLURM not use GPU 2 and 4?
> 
>     ---------------------------------------------------------------
>     Paul Raines http://help.nmr.mgh.harvard.edu
>     <http://help.nmr.mgh.harvard.edu>
>     MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
>     149 (2301) 13th Street     Charlestown, MA 02129            USA
> 
> 
> You can use the nvidia-smi command to 'drain' the GPU's which will 
> power-down the GPU's and no applications will use them.
> 
> This thread on stack overflow explains how to do that:
> 
> https://unix.stackexchange.com/a/654089/94412 
> <https://unix.stackexchange.com/a/654089/94412>
> 
> You can create a script to run at boot and 'drain' the cards.
> 
> Regards
> --Mick
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4252 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220131/de38c968/attachment-0001.bin>