[slurm-users] GPU utilization of running jobs
Vecerka Daniel
vecerka at fel.cvut.cz
Wed Oct 19 08:30:45 UTC 2022
Hi,
we want to push our users to run jobs with high GPU utilization.
Because it's difficult for users to get GPU utilization of their jobs, I
have decided to write script, which prints utilization of running jobs.
The idea is simple:
1. get list of running jobs in GPU partitions
2. get IDs of allocated GPUs for each job in the step 1 (scontrol show
job=$job_id -d )
3. get via Prometheus API utilization of the allocated GPU/s from
step 2 in given period , when job is running.
https://github.com/NVIDIA/dcgm-exporter is needed.
It works fine for our Intel nodes with 4 V100 GPUs, but for our AMD
nodes with 4 or 8 A100 GPUs there is problem, that IDs of allocated
GPUs printed by scontrol show job=$job_id -d don't correspond with IDs
which uses NVIDIA DCGM Exporter and nvidia-smi, so NVIDIA NML library.
GPU ID 1 in Slurm is ID 0 for NML , 1 ->0, 2->3 3->2 on 4 GPU nodes
and 0->2,1->3, 2->0, 3->1, 4->6, 5->7,6->4, 7-> 5 on 8 GPU nodes.
We are using Slurm 20.11.7 and gres.conf on intel nodes is:
AutoDetect=nvml
Name=gpu Type=v100 File=/dev/nvidia0
Name=gpu Type=v100 File=/dev/nvidia1
Name=gpu Type=v100 File=/dev/nvidia2
Name=gpu Type=v100 File=/dev/nvidia3
On AMD nodes
AutoDetect=nvml
Name=gpu Type=a100 File=/dev/nvidia0
Name=gpu Type=a100 File=/dev/nvidia1
Name=gpu Type=a100 File=/dev/nvidia2
Name=gpu Type=a100 File=/dev/nvidia3
There isn't problem to hack script to convert IDs for AMD nodes, so
script works fine for all our nodes, but I would like to publish script
on gitlab and prepare the script to be as universal as possible. My
question is: do you know, why Slurm uses sometimes different GPU IDs
than Nvidia NML library?
Another question is: do you know how to store IDs of used GPUs in
Slurmdb, so we can get GPU utilization of completed jobs?
We have in slurm.conf
AccountingStorageTRES=cpu,mem,gres/gpu
and only information what is stored, is number of allocated GPUs.
Thanks in advance, Daniel Vecerka, CTU in Prague
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221019/121c38d9/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4340 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221019/121c38d9/attachment-0001.bin>
More information about the slurm-users
mailing list