Looks like there is a step you would need to do to create the required job mapping files:

The DCGM-exporter can include High-Performance Computing (HPC) job information into its metric labels. To achieve this, HPC environment administrators must configure their HPC environment to generate files that map GPUs to HPC jobs.

It does go on to show the conventions/format of the files.

I imagine you could have some bits in a prologue script that creates that as the job starts on the node and point dcgm-exporter there.

Brian Andrus

On 10/16/24 06:10, Sylvain MARET via slurm-users wrote:
Hey guys !

I'm looking to improve GPU monitoring on our cluster. I want to install this https://github.com/NVIDIA/dcgm-exporter and saw in the README that it can support tracking of job id : https://github.com/NVIDIA/dcgm-exporter?tab=readme-ov-file#enabling-hpc-job-mapping-on-dcgm-exporter

However I haven't been able to see any examples on how to do it nor does slurm seem to expose this information by default.
Does anyone do this here ? And if so do you have any examples I could try to follow ? If you have advise on best practices to monitor GPU I'd be happy to hear it out !

Regards,
Sylvain Maret