Hi all,
I’m trying to pull (and understand) some GPU usage metrics for historical purposes, and dug into sacct’s TRES reporting a bit. We have AccountingStorageTRES=gres/gpu set in slurm.conf so we do see gres/gpuutil and gres/gpumem numbers available, but I’m struggling to find Slurm-side documentation that describes the units of these values. In looking at the code for gpu_nvml.c it seems the “nvmlDeviceGetProcessUtilization” function is being used and returns units in percentages, but I’m lost on the rest of the calculation.
Does anyone know if these units are percentages, and how they are calculated for the final job record, especially wrt multi-GPU jobs with a bunch of processes/moving parts? For context I’ve been looking at TRESUsageInTot and TRESUsageInAve so far. Also we’re currently running Slurm v23.02.6
Thanks in advance!
-- Jordan Robertson Preferred pronouns: he/him/his Technology Architect | Research Technology Services DigITs, Technology Division Memorial Sloan Kettering Cancer Center 929-687-1066 robertj8@mskcc.orgmailto:robertj8@mskcc.org
=====================================================================
Please note that this e-mail and any files transmitted from Memorial Sloan Kettering Cancer Center may be privileged, confidential, and protected from disclosure under applicable law. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any reading, dissemination, distribution, copying, or other use of this communication or any of its attachments is strictly prohibited. If you have received this communication in error, please notify the sender immediately by replying to this message and deleting this message, any attachments, and all copies and backups from your computer.
Disclaimer ID:MSKCC