[slurm-users] How to tell SLURM to ignore specific GPUs
EPF (Esben Peter Friis)
EPF at novozymes.com
Tue Feb 1 08:09:34 UTC 2022
The numbering seen from nvidia-smi is not necessarily the same as the order of /dev/nvidiaXX.
There is a way to force that, though, using CUDA_DEVICE_ORDER.
See https://shawnliu.me/post/nvidia-gpu-id-enumeration-in-linux/
Cheers,
Esben
________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Timony, Mick <Michael_Timony at hms.harvard.edu>
Sent: Monday, January 31, 2022 15:45
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] How to tell SLURM to ignore specific GPUs
I have a large compute node with 10 RTX8000 cards at a remote colo.
One of the cards on it is acting up "falling of the bus" once a day
requiring a full power cycle to reset.
I want jobs to avoid that card as well as the card it is NVLINK'ed to.
So I modified gres.conf on that node as follows:
# cat /etc/slurm/gres.conf
AutoDetect=nvml
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia0
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia1
#Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia2
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia3
#Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia4
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia5
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia6
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia7
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia8
Name=gpu Type=quadro_rtx_8000 File=/dev/nvidia9
and it slurm.conf I changed for node def Gres=gpu:quadro_rtx_8000:10
to be Gres=gpu:quadro_rtx_8000:8. I restarted slurmctld and slurmd
after this.
I then put the node back from drain to idle. Jobs were sumbitted and
started on the node but they are using the GPU I told it to avoid
+--------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|====================================================================|
| 0 N/A N/A 63426 C python 11293MiB |
| 1 N/A N/A 63425 C python 11293MiB |
| 2 N/A N/A 63425 C python 10869MiB |
| 2 N/A N/A 63426 C python 10869MiB |
| 4 N/A N/A 63425 C python 10849MiB |
| 4 N/A N/A 63426 C python 10849MiB |
+--------------------------------------------------------------------+
How can I make SLURM not use GPU 2 and 4?
---------------------------------------------------------------
Paul Raines http://help.nmr.mgh.harvard.edu<https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhelp.nmr.mgh.harvard.edu%2F&data=04%7C01%7Cepf%40novozymes.com%7C295ce6d305984d38921e08d9e4c88781%7C43d5f49ee03a4d22a2285684196bb001%7C0%7C0%7C637792372191800703%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=oxpga348rfrZpOg0XSDepHfdUHirfgq46c6ZXcYoHvI%3D&reserved=0>
MGH/MIT/HMS Athinoula A. Martinos Center for Biomedical Imaging
149 (2301) 13th Street Charlestown, MA 02129 USA
You can use the nvidia-smi command to 'drain' the GPU's which will power-down the GPU's and no applications will use them.
This thread on stack overflow explains how to do that:
https://unix.stackexchange.com/a/654089/94412<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Funix.stackexchange.com%2Fa%2F654089%2F94412&data=04%7C01%7Cepf%40novozymes.com%7C295ce6d305984d38921e08d9e4c88781%7C43d5f49ee03a4d22a2285684196bb001%7C0%7C0%7C637792372191956924%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Z8EBb1jxUD0sECO2R1m0CYIn4xy6HA%2Fx5AsqIBykoCY%3D&reserved=0>
You can create a script to run at boot and 'drain' the cards.
Regards
--Mick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220201/4220b37a/attachment.htm>
More information about the slurm-users
mailing list