If you do scontrol -d show node it will give what resources are actually being used in more details:
[root@holy8a24507 general]# scontrol show node holygpu8a11101
NodeName=holygpu8a11101 Arch=x86_64 CoresPerSocket=48
CPUAlloc=70 CPUEfctv=96 CPUTot=96 CPULoad=173.07
AvailableFeatures=amd,holyndr,genoa,avx,avx2,avx512,gpu,h100,cc9.0
ActiveFeatures=amd,holyndr,genoa,avx,avx2,avx512,gpu,h100,cc9.0
Gres=gpu:nvidia_h100_80gb_hbm3:4(S:0-15)
NodeAddr=holygpu8a11101 NodeHostName=holygpu8a11101
Version=24.11.2
OS=Linux 4.18.0-513.18.1.el8_9.x86_64 #1 SMP Wed Feb 21
21:34:36 UTC 2024
RealMemory=1547208 AllocMem=896000 FreeMem=330095 Sockets=2
Boards=1
MemSpecLimit=16384
State=MIXED ThreadsPerCore=1 TmpDisk=863490 Weight=1442
Owner=N/A MCS_label=N/A
Partitions=kempner_requeue,kempner_dev,kempner_h100,kempner_h100_priority,gpu_requeue,serial_requeue
BootTime=2024-10-23T13:10:56
SlurmdStartTime=2025-03-24T14:51:01
LastBusyTime=2025-03-30T15:55:51 ResumeAfterTime=None
CfgTRES=cpu=96,mem=1547208M,billing=2302,gres/gpu=4,gres/gpu:nvidia_h100_80gb_hbm3=4
AllocTRES=cpu=70,mem=875G,gres/gpu=4,gres/gpu:nvidia_h100_80gb_hbm3=4
CurrentWatts=0 AveWatts=0
[root@holy8a24507 general]# scontrol -d show node holygpu8a11101
NodeName=holygpu8a11101 Arch=x86_64 CoresPerSocket=48
CPUAlloc=70 CPUEfctv=96 CPUTot=96 CPULoad=173.07
AvailableFeatures=amd,holyndr,genoa,avx,avx2,avx512,gpu,h100,cc9.0
ActiveFeatures=amd,holyndr,genoa,avx,avx2,avx512,gpu,h100,cc9.0
Gres=gpu:nvidia_h100_80gb_hbm3:4(S:0-15)
GresDrain=N/A
GresUsed=gpu:nvidia_h100_80gb_hbm3:4(IDX:0-3)
NodeAddr=holygpu8a11101 NodeHostName=holygpu8a11101
Version=24.11.2
OS=Linux 4.18.0-513.18.1.el8_9.x86_64 #1 SMP Wed Feb 21
21:34:36 UTC 2024
RealMemory=1547208 AllocMem=896000 FreeMem=330095 Sockets=2
Boards=1
MemSpecLimit=16384
State=MIXED ThreadsPerCore=1 TmpDisk=863490 Weight=1442
Owner=N/A MCS_label=N/A
Partitions=kempner_requeue,kempner_dev,kempner_h100,kempner_h100_priority,gpu_requeue,serial_requeue
BootTime=2024-10-23T13:10:56
SlurmdStartTime=2025-03-24T14:51:01
LastBusyTime=2025-03-30T15:55:51 ResumeAfterTime=None
CfgTRES=cpu=96,mem=1547208M,billing=2302,gres/gpu=4,gres/gpu:nvidia_h100_80gb_hbm3=4
AllocTRES=cpu=70,mem=875G,gres/gpu=4,gres/gpu:nvidia_h100_80gb_hbm3=4
CurrentWatts=0 AveWatts=0
Now it won't give you individual performance of the GPU's, slurm doesn't currently track that in a convenient way like it does cpuload. It will at least give you what has been allocated on the node. We take the nondetailed dump (as it details how many gpus are allocated but not which ones) and throw it into grafana via prometheus to get general cluster stats: https://github.com/fasrc/prometheus-slurm-exporter
If you are looking for performance stats, NVIDIA has a DCGM exporter that we use to pull them and dump them to grafana: https://github.com/NVIDIA/dcgm-exporter
On a per job basis I know people use Weights & Biases but that is code specific: https://wandb.ai/site/ You can also use scontrol -d show job to print out the layout of a job including which specific GPU's were assigned.
-Paul Edmon-
Hello all,
Apologies for the basic question, but is there a straightforward, best-accepted method for using Slurm to report on which GPUs are currently in use? I've done some searching and people recommend all sorts of methods, including parsing the output of nvidia-smi (seems inefficient, especially across multiple GPU nodes), as well as using other tools such as Grafana, XDMoD, etc.
We do track GPUs as a resource, so I'd expect I could get at the info with sreport or something like that, but before trying to craft my own from scratch, I'm hoping someone has something working already. Ultimately I'd like to see either which cards are available by node, or the reverse (which are in use by node). I know recent versions of Slurm supposedly added tighter integration in some way with NVIDIA cards, but I can't seem to find definitive docs on what, exactly, changed or what is now possible as a result.
Warmest regards,Jason--
Jason L. Simms, Ph.D., M.P.H.Research Computing Manager
Swarthmore College
Information Technology Services(610) 328-8102