<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
<font size="4"><font face="Helvetica, Arial, sans-serif">Hi,<br>
<br>
we want to push our users to run jobs with high GPU
utilization. Because it's difficult for users to get GPU
utilization of their jobs, I have decided to write script, which
prints utilization of running jobs. The idea is simple: <br>
<br>
1. get list of running jobs in GPU partitions<br>
2. get IDs of allocated GPUs for each job in the step 1
(scontrol show job=$job_id -d )<br>
3. get via Prometheus API utilization of the allocated GPU/s
from step 2 in given period , when job is running.<br>
<a class="moz-txt-link-freetext" href="https://github.com/NVIDIA/dcgm-exporter">https://github.com/NVIDIA/dcgm-exporter</a> is needed.<br>
<br>
It works fine for our Intel nodes with 4 V100 GPUs, but for
our AMD nodes with 4 or 8 A100 GPUs there is problem, that IDs
of allocated GPUs printed by </font></font><font size="4"><font
face="Helvetica, Arial, sans-serif">scontrol show job=$job_id -d
don't correspond with IDs which uses NVIDIA DCGM Exporter and
nvidia-smi, so NVIDIA NML library.<br>
GPU ID 1 in Slurm is ID 0 for NML , 1 ->0, 2->3 3->2
on 4 GPU nodes and 0->2,1->3, 2->0, 3->1, 4->6,
5->7,6->4, 7-> 5 on 8 GPU nodes.<br>
<br>
We are using Slurm 20.11.7 and gres.conf on intel nodes is:<br>
</font></font><font size="4"><font face="Helvetica, Arial,
sans-serif"></font></font><br>
AutoDetect=nvml<br>
Name=gpu Type=v100 File=/dev/nvidia0 <br>
Name=gpu Type=v100 File=/dev/nvidia1 <br>
Name=gpu Type=v100 File=/dev/nvidia2 <br>
Name=gpu Type=v100 File=/dev/nvidia3 <br>
<br>
On AMD nodes<br>
<br>
AutoDetect=nvml<br>
Name=gpu Type=a100 File=/dev/nvidia0 <br>
Name=gpu Type=a100 File=/dev/nvidia1 <br>
Name=gpu Type=a100 File=/dev/nvidia2 <br>
Name=gpu Type=a100 File=/dev/nvidia3 <br>
<br>
There isn't problem to hack script to convert IDs for AMD nodes, so
script works fine for all our nodes, but I would like to publish
script on gitlab and prepare the script to be as universal as
possible. My question is: do you know, why Slurm uses sometimes
different GPU IDs than Nvidia NML library?<br>
<br>
Another question is: do you know how to store IDs of used GPUs in
Slurmdb, so we can get GPU utilization of completed jobs? <br>
<br>
We have in slurm.conf<br>
AccountingStorageTRES=cpu,mem,gres/gpu<br>
<br>
and only information what is stored, is number of allocated GPUs. <br>
<br>
Thanks in advance, Daniel Vecerka, CTU in Prague<br>
<br>
</body>
</html>