[slurm-users] Re: [EXTERN] How do you guys track which GPU is used by which job ?

17 Oct 2024


      Interesting solution didn't know it was possible to do this.
Will try to test this also !
Sylvain
On 17/10/2024 10:45, Pierre-Antoine Schnell via slurm-users wrote:
...
CAUTION : External Sender. Please do not click on links or open 
attachments from senders you do not trust.
Hello,
we recently started monitoring GPU usage on our GPUs with NVIDIA's DCGM:
https://developer.nvidia.com/blog/job-statistics-nvidia-data-center-gpu-mana...
We create a new dcgmi group for each job and start the statistics
retrieval for it in a prolog script.
Then we stop the retrieval, save the dcgmi verbose stats output and
delete the dcgmi group in an epilog script.
The output presents JobID, GPU IDs, runtime, energy consumed, and SM
utilization, among other things.
We retrieve the relevant data into a database and hope to be able to
advise our users on better practices based on the analysis of it.
Best wishes,
Pierre-Antoine Schnell
Am 16.10.24 um 15:10 schrieb Sylvain MARET via slurm-users:
...
Hey guys !
I'm looking to improve GPU monitoring on our cluster. I want to install
this https://github.com/NVIDIA/dcgm-exporter and saw in the README that
it can support tracking of job id :
https://github.com/NVIDIA/dcgm-exporter?tab=readme-ov-file#enabling-hpc-job-...
However I haven't been able to see any examples on how to do it nor does
slurm seem to expose this information by default.
Does anyone do this here ? And if so do you have any examples I could
try to follow ? If you have advise on best practices to monitor GPU I'd
be happy to hear it out !
Regards,
Sylvain Maret
-- 
Pierre-Antoine Schnell
Medizinische Universität Wien
IT-Dienste & Strategisches Informationsmanagement
Enterprise Technology & Infrastructure
High Performance Computing
1090 Wien, Spitalgasse 23
Bauteil 88, Ebene 00, Büro 611
+43 1 40160-21304
pierre-antoine.schnell@meduniwien.ac.at
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

2025

2024

[slurm-users] Re: [EXTERN] How do you guys track which GPU is used by which job ?