[slurm-users] How do you guys track which GPU is used by which job ?

16 Oct 2024


      Hey guys !
I'm looking to improve GPU monitoring on our cluster. I want to install 
this https://github.com/NVIDIA/dcgm-exporter and saw in the README that 
it can support tracking of job id : 
https://github.com/NVIDIA/dcgm-exporter?tab=readme-ov-file#enabling-hpc-job-...
However I haven't been able to see any examples on how to do it nor does 
slurm seem to expose this information by default.
Does anyone do this here ? And if so do you have any examples I could 
try to follow ? If you have advise on best practices to monitor GPU I'd 
be happy to hear it out !
Regards,
Sylvain Maret

2025

2024

[slurm-users] How do you guys track which GPU is used by which job ?