Hi all,
I’m trying to pull (and understand) some GPU usage metrics for historical purposes, and dug into sacct’s TRES reporting a bit. We have AccountingStorageTRES=gres/gpu set in slurm.conf so we do
see gres/gpuutil and gres/gpumem numbers available, but I’m struggling to find Slurm-side documentation that describes the units of these values. In looking at the code for gpu_nvml.c it seems the “nvmlDeviceGetProcessUtilization” function is being used and
returns units in percentages, but I’m lost on the rest of the calculation.
Does anyone know if these units are percentages, and how they are calculated for the final job record, especially wrt multi-GPU jobs with a bunch of processes/moving parts? For context I’ve been
looking at TRESUsageInTot and TRESUsageInAve so far. Also we’re currently running Slurm v23.02.6
Thanks in advance!
--
Jordan Robertson
Preferred pronouns: he/him/his
Technology Architect | Research Technology Services
DigITs, Technology Division
Memorial Sloan Kettering Cancer Center
929-687-1066
robertj8@mskcc.org