[slurm-users] How to tell SLURM to ignore specific GPUs
Stephan Roth
stephan.roth at ee.ethz.ch
Mon Jan 31 20:54:59 UTC 2022
Not a solution, but some ideas & experiences concerning the same topic:
A few of our older GPUs used to show the error message "has fallen off
the bus" which was only resolved by a full power cycle as well.
Something changed, nowadays the error messages is "GPU lost" and a
normal reboot resolves the problem. This might be a result of an update
of the Nvidia drivers (currently 60.73.01), but I can't be sure.
The current behaviour allowed us to write a script checking GPU state
every 10 minutes and setting a node to drain&reboot state when such a
"lost GPU" is detected.
This has been working well for a couple of months now and saves us time.
It might help as well to re-seat all GPUs and PCI risers, this also
seemed to help in one of our GPU nodes. Again, I can't be sure, we'd
need to do try this with other - still failing - GPUs.
The problem is to identify the cards physically from the information we
have, like what's reported with nvidia-smi or available in
/proc/driver/nvidia/gpus/*/information
The serial number isn't shown for every type of GPU and I'm not sure the
ones shown match the stickers on the GPUs.
If anybody were to know of a practical solution for this, I'd be happy
to read it.
Eventually I'd like to pull out all cards which repeatedly get "lost"
and maybe move them all to a node for short debug jobs or throw them
away (they're all beyond warranty anyway).
Stephan
On 31.01.22 15:45, Timony, Mick wrote:
> I have a large compute node with 10 RTX8000 cards at a remote colo.
> One of the cards on it is acting up "falling of the bus" once a day
> requiring a full power cycle to reset.
>
> I want jobs to avoid that card as well as the card it is NVLINK'ed to.
>
> So I modified gres.conf on that node as follows:
>
>
> # cat /etc/slurm/gres.conf
> AutoDetect=nvml
> Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia0
> Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia1
> #Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia2
> Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia3
> #Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia4
> Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia5
> Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia6
> Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia7
> Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia8
> Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia9
>
> and it slurm.conf I changed for node def Gres=gpu:quadro_rtx_8000:10
> to be Gres=gpu:quadro_rtx_8000:8. I restarted slurmctld and slurmd
> after this.
>
> I then put the node back from drain to idle. Jobs were sumbitted and
> started on the node but they are using the GPU I told it to avoid
>
> +--------------------------------------------------------------------+
> | Processes: |
> | GPU GI CI PID Type Process name GPU Memory |
> | ID ID Usage |
> |====================================================================|
> | 0 N/A N/A 63426 C python 11293MiB |
> | 1 N/A N/A 63425 C python 11293MiB |
> | 2 N/A N/A 63425 C python 10869MiB |
> | 2 N/A N/A 63426 C python 10869MiB |
> | 4 N/A N/A 63425 C python 10849MiB |
> | 4 N/A N/A 63426 C python 10849MiB |
> +--------------------------------------------------------------------+
>
> How can I make SLURM not use GPU 2 and 4?
>
> ---------------------------------------------------------------
> Paul Raines http://help.nmr.mgh.harvard.edu
> <http://help.nmr.mgh.harvard.edu>
> MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
> 149 (2301) 13th Street Charlestown, MA 02129 USA
>
>
> You can use the nvidia-smi command to 'drain' the GPU's which will
> power-down the GPU's and no applications will use them.
>
> This thread on stack overflow explains how to do that:
>
> https://unix.stackexchange.com/a/654089/94412
> <https://unix.stackexchange.com/a/654089/94412>
>
> You can create a script to run at boot and 'drain' the cards.
>
> Regards
> --Mick
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4252 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220131/de38c968/attachment-0001.bin>
More information about the slurm-users
mailing list