[slurm-users] GPU utilization of running jobs

Vecerka Daniel vecerka at fel.cvut.cz
Wed Oct 19 08:30:45 UTC 2022


  we want to push our users to run jobs with high GPU utilization. 
Because it's difficult for users to get GPU utilization of their jobs, I 
have decided to write script, which prints utilization of running jobs. 
The idea is simple:

  1. get list of running jobs in GPU partitions
  2. get IDs of allocated GPUs for each job in the step 1 (scontrol show 
job=$job_id -d )
  3. get via Prometheus API  utilization of the allocated GPU/s from 
step 2 in given period , when job is running.
https://github.com/NVIDIA/dcgm-exporter  is needed.

   It works fine for our Intel nodes with 4 V100 GPUs, but for our AMD 
nodes with 4 or 8 A100 GPUs there is  problem, that IDs of allocated 
GPUs  printed by scontrol show job=$job_id -d don't correspond with IDs 
which uses NVIDIA DCGM Exporter and nvidia-smi, so NVIDIA NML library.
GPU ID 1 in Slurm  is ID 0 for NML , 1 ->0, 2->3 3->2 on 4 GPU nodes  
and 0->2,1->3, 2->0, 3->1, 4->6, 5->7,6->4, 7-> 5 on 8 GPU nodes.

We are using Slurm 20.11.7 and  gres.conf  on intel nodes is:

Name=gpu Type=v100 File=/dev/nvidia0
Name=gpu Type=v100 File=/dev/nvidia1
Name=gpu Type=v100 File=/dev/nvidia2
Name=gpu Type=v100 File=/dev/nvidia3

On AMD nodes

Name=gpu Type=a100 File=/dev/nvidia0
Name=gpu Type=a100 File=/dev/nvidia1
Name=gpu Type=a100 File=/dev/nvidia2
Name=gpu Type=a100 File=/dev/nvidia3

There isn't problem to hack script to convert IDs for AMD nodes, so 
script works fine for all our nodes, but I would like to publish script 
on gitlab and prepare the script to be as universal as possible. My 
question is:  do you know, why Slurm uses sometimes different GPU IDs 
than Nvidia NML library?

Another question is: do you know how to store IDs of used GPUs in 
Slurmdb, so we can get GPU utilization of completed jobs?

We have in slurm.conf

and only information what is stored, is number of allocated GPUs.

Thanks in advance,  Daniel Vecerka, CTU in Prague
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221019/121c38d9/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4340 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221019/121c38d9/attachment-0001.bin>

More information about the slurm-users mailing list