Hi all,

 

I’m trying to pull (and understand) some GPU usage metrics for historical purposes, and dug into sacct’s TRES reporting a bit. We have AccountingStorageTRES=gres/gpu set in slurm.conf so we do see gres/gpuutil and gres/gpumem numbers available, but I’m struggling to find Slurm-side documentation that describes the units of these values. In looking at the code for gpu_nvml.c it seems the “nvmlDeviceGetProcessUtilization” function is being used and returns units in percentages, but I’m lost on the rest of the calculation.

 

Does anyone know if these units are percentages, and how they are calculated for the final job record, especially wrt multi-GPU jobs with a bunch of processes/moving parts? For context I’ve been looking at TRESUsageInTot and TRESUsageInAve so far. Also we’re currently running Slurm v23.02.6

 

Thanks in advance!

 

-- 

Jordan Robertson

Preferred pronouns: he/him/his

Technology Architect | Research Technology Services

DigITs, Technology Division

Memorial Sloan Kettering Cancer Center

929-687-1066
robertj8@mskcc.org

=====================================================================

Please note that this e-mail and any files transmitted from
Memorial Sloan Kettering Cancer Center may be privileged, confidential,
and protected from disclosure under applicable law. If the reader of
this message is not the intended recipient, or an employee or agent
responsible for delivering this message to the intended recipient,
you are hereby notified that any reading, dissemination, distribution,
copying, or other use of this communication or any of its attachments
is strictly prohibited. If you have received this communication in
error, please notify the sender immediately by replying to this message
and deleting this message, any attachments, and all copies and backups
from your computer.

Disclaimer ID:MSKCC