[slurm-users] FW: gres/gpu count lower than reported
Stephan Roth
stephan.roth at ee.ethz.ch
Tue May 3 19:36:29 UTC 2022
Hi Jim,
I don't know if it makes a difference, but I only ever use the complete
numeric suffix within brackets, as in
sjc01enadsapp[01-08]
Otherwise I'd raise the debug level of slurmd to maximum by setting
SlurmdDebug=debug5
in /slurm.conf/, tail /SlurmdLogFile/ on a GPU node and then restart
/slurmd/ there.
This might shed some light on what goes wrong.
Cheers,
Stephan
On 03.05.22 20:51, Jim Kavitsky wrote:
>
> Whoops. Sent the first to an incorrect address….apologies if this
> shows up as a duplicate.
>
> -jimk
>
> *From: *Jim Kavitsky <JimKavitsky at lucidmotors.com>
> *Date: *Tuesday, May 3, 2022 at 11:46 AM
> *To: *slurm-users at schedmd.com <slurm-users at schedmd.com>
> *Subject: *gres/gpu count lower than reported
>
> Hello Fellow Slurm Admins,
>
> I have a new Slurm installation that was working and running basic
> test jobs until I added gpu support. My worker nodes are now all in
> drain state, with gres/gpu count reported lower than configured (0 < 4)
>
> This is in spite of the fact that nvidia-smi reports all four A100’s
> as active on each node. I have spent a good chunk of a week googling
> around for the solution to this, and trying variants of the gpu config
> lines/restarting daemons without any luck.
>
> The relevant lines from my current config files are below. The head
> node and all workers have the same gres.conf and slurm.conf files. Can
> anyone suggest anything else I should be looking at or adding? I’m
> guessing that this is a problem that many have faced, and any guidance
> would be greatly appreciated.
>
> root at sjc01enadsapp00:/etc/slurm-llnl# grep gpu slurm.conf
>
> GresTypes=*gpu*
>
> NodeName=sjc01enadsapp0[1-8] RealMemory=2063731 Sockets=2
> CoresPerSocket=16 ThreadsPerCore=2 Gres=*gpu*:4 State=UNKNOWN
>
> root at sjc01enadsapp00:/etc/slurm-llnl# cat gres.conf
>
> NodeName=sjc01enadsapp0[1-8] Name=gpu File=/dev/nvidia[0-3]
>
> root at sjc01enadsapp00:~# sinfo -N -o "%.20N %.15C %.10t %.10m %.15P
> %.15G %.75E"
>
> NODELIST CPUS(A/I/O/T)STATE MEMORY PARTITIONGRESREASON
>
> sjc01enadsapp01 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count
> reported lower than configured (0 < 4)
>
> sjc01enadsapp02 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count
> reported lower than configured (0 < 4)
>
> sjc01enadsapp03 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count
> reported lower than configured (0 < 4)
>
> sjc01enadsapp04 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count
> reported lower than configured (0 < 4)
>
> sjc01enadsapp05 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count
> reported lower than configured (0 < 4)
>
> sjc01enadsapp06 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count
> reported lower than configured (0 < 4)
>
> sjc01enadsapp07 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count
> reported lower than configured (0 < 4)
>
> sjc01enadsapp08 0/0/64/64drain2063731Primary* gpu:4 gres/gpu count
> reported lower than configured (0 < 4)
>
> root at sjc01enadsapp07:~# nvidia-smi
>
> Tue May 3 18:41:34 2022
>
> +-----------------------------------------------------------------------------+
>
> | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version:
> 11.4 |
>
> |-------------------------------+----------------------+----------------------+
>
> | GPU Name Persistence-M| Bus-Id Disp.A | Volatile
> Uncorr. ECC |
>
> | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util
> Compute M. |
>
> | | | MIG M. |
>
> |===============================+======================+======================|
>
> | 0 NVIDIA A100-PCI... On | 00000000:17:00.0 Off | 0 |
>
> | N/A 42C P0 49W / 250W | 4MiB / 40536MiB | 0%
> Default |
>
> | | | Disabled |
>
> +-------------------------------+----------------------+----------------------+
>
> | 1 NVIDIA A100-PCI... On | 00000000:65:00.0 Off | 0 |
>
> | N/A 41C P0 48W / 250W | 4MiB / 40536MiB | 0%
> Default |
>
> | | | Disabled |
>
> +-------------------------------+----------------------+----------------------+
>
> | 2 NVIDIA A100-PCI... On | 00000000:CA:00.0 Off | 0 |
>
> | N/A 35C P0 44W / 250W | 4MiB / 40536MiB | 0%
> Default |
>
> | | | Disabled |
>
> +-------------------------------+----------------------+----------------------+
>
> | 3 NVIDIA A100-PCI... On | 00000000:E3:00.0 Off | 0 |
>
> | N/A 38C P0 45W / 250W | 4MiB / 40536MiB | 0%
> Default |
>
> | | | Disabled |
>
> +-------------------------------+----------------------+----------------------+
>
> +-----------------------------------------------------------------------------+
>
> | Processes: |
>
> | GPU GI CI PID Type Process name GPU Memory |
>
> | ID ID Usage |
>
> |=============================================================================|
>
> | 0 N/A N/A 2179 G /usr/lib/xorg/Xorg 4MiB |
>
> | 1 N/A N/A 2179 G /usr/lib/xorg/Xorg 4MiB |
>
> | 2 N/A N/A 2179 G /usr/lib/xorg/Xorg 4MiB |
>
> | 3 N/A N/A 2179 G /usr/lib/xorg/Xorg 4MiB |
>
> +-----------------------------------------------------------------------------+
>
>
>
> This message and any attachments are Confidential Information, for the
> exclusive use of the addressee and may be legally privileged. Any
> receipt by anyone other than the intended addressee does not
> constitute a loss of the confidential or privileged nature of the
> communication. Any other distribution, use or reproduction is
> unauthorized and prohibited. If you are not the intended recipient,
> please contact the sender by return electronic mail and delete all
> copies of this communication
>
--
ETH Zurich
Stephan Roth
Systems Administrator
IT Support Group (ISG)
D-ITET
ETF D 104
Sternwartstrasse 7
8092 Zurich
Phone +41 44 632 30 59
stephan.roth at ee.ethz.ch
www.isg.ee.ethz.ch
Working days: Mon,Tue,Thu,Fri
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220503/7fa090c3/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4252 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220503/7fa090c3/attachment-0001.bin>
More information about the slurm-users
mailing list