[slurm-users] Re: How do you guys track which GPU is used by which job ?

17 Oct 2024


      Started testing in prolog and you're right !
Before doing anything I wanted to see if there was a best practices.
Regards,
Sylvain Maret
On 16/10/2024 18:03, Brian Andrus via slurm-users wrote:
...
CAUTION : External Sender. Please do not click on links or open 
attachments from senders you do not trust.
Looks like there is a step you would need to do to create the required 
job mapping files:
/The DCGM-exporter can include High-Performance Computing (HPC) job 
information into its metric labels. To achieve this, HPC environment 
administrators must configure their HPC environment to generate files 
that map GPUs to HPC jobs./
It does go on to show the conventions/format of the files.
I imagine you could have some bits in a prologue script that creates 
that as the job starts on the node and point dcgm-exporter there.
Brian Andrus
On 10/16/24 06:10, Sylvain MARET via slurm-users wrote:
...
Hey guys !
I'm looking to improve GPU monitoring on our cluster. I want to 
install this https://github.com/NVIDIA/dcgm-exporter and saw in the 
README that it can support tracking of job id : 
https://github.com/NVIDIA/dcgm-exporter?tab=readme-ov-file#enabling-hpc-job-...
However I haven't been able to see any examples on how to do it nor 
does slurm seem to expose this information by default.
Does anyone do this here ? And if so do you have any examples I could 
try to follow ? If you have advise on best practices to monitor GPU 
I'd be happy to hear it out !
Regards,
Sylvain Maret

2025

2024

[slurm-users] Re: How do you guys track which GPU is used by which job ?