[slurm-users] FW: gres/gpu count lower than reported

Stephan Roth stephan.roth at ee.ethz.ch
Tue May 3 19:36:29 UTC 2022


Hi Jim,

I don't know if it makes a difference, but I only ever use the complete 
numeric suffix within brackets, as in

sjc01enadsapp[01-08]

Otherwise I'd raise the debug level of slurmd to maximum by setting

SlurmdDebug=debug5

in /slurm.conf/, tail /SlurmdLogFile/ on a GPU node and then restart 
/slurmd/ there.
This might shed some light on what goes wrong.

Cheers,
Stephan

On 03.05.22 20:51, Jim Kavitsky wrote:
>
> Whoops. Sent the first to an incorrect address….apologies if this 
> shows up as a duplicate.
>
> -jimk
>
> *From: *Jim Kavitsky <JimKavitsky at lucidmotors.com>
> *Date: *Tuesday, May 3, 2022 at 11:46 AM
> *To: *slurm-users at schedmd.com <slurm-users at schedmd.com>
> *Subject: *gres/gpu count lower than reported
>
> Hello Fellow Slurm Admins,
>
> I have a new Slurm installation that was working and running basic 
> test jobs until I added gpu support. My worker nodes are now all in 
> drain state, with gres/gpu count reported lower than configured (0 < 4)
>
> This is in spite of the fact that nvidia-smi reports all four A100’s 
> as active on each node. I have spent a good chunk of a week googling 
> around for the solution to this, and trying variants of the gpu config 
> lines/restarting daemons without any luck.
>
> The relevant lines from my current config files are below. The head 
> node and all workers have the same gres.conf and slurm.conf files. Can 
> anyone suggest anything else I should be looking at or adding? I’m 
> guessing that this is a problem that many have faced, and any guidance 
> would be greatly appreciated.
>
> root at sjc01enadsapp00:/etc/slurm-llnl# grep gpu slurm.conf
>
> GresTypes=*gpu*
>
> NodeName=sjc01enadsapp0[1-8] RealMemory=2063731 Sockets=2 
> CoresPerSocket=16 ThreadsPerCore=2 Gres=*gpu*:4 State=UNKNOWN
>
> root at sjc01enadsapp00:/etc/slurm-llnl# cat gres.conf
>
> NodeName=sjc01enadsapp0[1-8] Name=gpu File=/dev/nvidia[0-3]
>
> root at sjc01enadsapp00:~# sinfo -N -o "%.20N %.15C %.10t %.10m %.15P 
> %.15G %.75E"
>
> NODELIST CPUS(A/I/O/T)STATE MEMORY PARTITIONGRESREASON
>
> sjc01enadsapp01 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count 
> reported lower than configured (0 < 4)
>
> sjc01enadsapp02 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count 
> reported lower than configured (0 < 4)
>
> sjc01enadsapp03 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count 
> reported lower than configured (0 < 4)
>
> sjc01enadsapp04 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count 
> reported lower than configured (0 < 4)
>
> sjc01enadsapp05 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count 
> reported lower than configured (0 < 4)
>
> sjc01enadsapp06 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count 
> reported lower than configured (0 < 4)
>
> sjc01enadsapp07 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count 
> reported lower than configured (0 < 4)
>
> sjc01enadsapp08 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count 
> reported lower than configured (0 < 4)
>
> root at sjc01enadsapp07:~# nvidia-smi
>
> Tue May  3 18:41:34 2022
>
> +-----------------------------------------------------------------------------+
>
> | NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 
> 11.4     |
>
> |-------------------------------+----------------------+----------------------+
>
> | GPU Name        Persistence-M| Bus-Id        Disp.A | Volatile 
> Uncorr. ECC |
>
> | Fan Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util 
> Compute M. |
>
> |                           |                      |     MIG M. |
>
> |===============================+======================+======================|
>
> |   0 NVIDIA A100-PCI...  On   | 00000000:17:00.0 Off |         0 |
>
> | N/A   42C    P0    49W / 250W |      4MiB / 40536MiB |      0%     
> Default |
>
> |                           |                      |   Disabled |
>
> +-------------------------------+----------------------+----------------------+
>
> |   1 NVIDIA A100-PCI...  On   | 00000000:65:00.0 Off |         0 |
>
> | N/A   41C    P0    48W / 250W |      4MiB / 40536MiB |      0%     
> Default |
>
> |                           |                      |   Disabled |
>
> +-------------------------------+----------------------+----------------------+
>
> |   2 NVIDIA A100-PCI...  On   | 00000000:CA:00.0 Off |         0 |
>
> | N/A   35C    P0    44W / 250W |      4MiB / 40536MiB |      0%     
> Default |
>
> |                           |                      |   Disabled |
>
> +-------------------------------+----------------------+----------------------+
>
> |   3 NVIDIA A100-PCI...  On   | 00000000:E3:00.0 Off |         0 |
>
> | N/A   38C    P0    45W / 250W |      4MiB / 40536MiB |      0%     
> Default |
>
> |                           |                      |   Disabled |
>
> +-------------------------------+----------------------+----------------------+
>
> +-----------------------------------------------------------------------------+
>
> | Processes:                 |
>
> |  GPU   GI   CI        PID   Type   Process name GPU Memory |
>
> |   ID   ID Usage      |
>
> |=============================================================================|
>
> |    0   N/A  N/A      2179      G   /usr/lib/xorg/Xorg       4MiB |
>
> |    1   N/A  N/A      2179      G   /usr/lib/xorg/Xorg       4MiB |
>
> |    2   N/A  N/A      2179      G   /usr/lib/xorg/Xorg       4MiB |
>
> |    3   N/A  N/A      2179      G   /usr/lib/xorg/Xorg       4MiB |
>
> +-----------------------------------------------------------------------------+
>
>
>
> This message and any attachments are Confidential Information, for the 
> exclusive use of the addressee and may be legally privileged. Any 
> receipt by anyone other than the intended addressee does not 
> constitute a loss of the confidential or privileged nature of the 
> communication. Any other distribution, use or reproduction is 
> unauthorized and prohibited. If you are not the intended recipient, 
> please contact the sender by return electronic mail and delete all 
> copies of this communication
>

--
ETH Zurich
Stephan Roth
Systems Administrator
IT Support Group (ISG)
D-ITET
ETF D 104
Sternwartstrasse 7
8092 Zurich

Phone +41 44 632 30 59
stephan.roth at ee.ethz.ch
www.isg.ee.ethz.ch

Working days: Mon,Tue,Thu,Fri
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220503/7fa090c3/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4252 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220503/7fa090c3/attachment-0001.bin>


More information about the slurm-users mailing list