Hi,
As their example was limited too "allgpus", I had posted my take on this on the nvidia developer blog.
Basically all the same, but lookups the groupid from the dcgmi group json using jp instead of a file.
https://developer.nvidia.com/blog/job-statistics-nvidia-data-center-gpu-mana...
prolog
group=$(sudo -u $SLURM_JOB_USER dcgmi group -c j$SLURM_JOB_ID) if [ $? -eq 0 ]; then groupid=$(echo $group | awk '{print $10}') sudo -u $SLURM_JOB_USER dcgmi group --group $groupid --add $SLURM_JOB_GPUS sudo -u $SLURM_JOB_USER dcgmi stats --group $groupid --enable sudo -u $SLURM_JOB_USER dcgmi stats --group $groupid --jstart $SLURM_JOBID fi
epilog
OUTPUTDIR=/tmp/ sudo -u $SLURM_JOB_USER dcgmi stats --jstop $SLURM_JOBID sudo -u $SLURM_JOB_USER dcgmi stats --verbose --job $SLURM_JOBID | sudo -u $SLURM_JOB_USER tee $OUTPUTDIR/dcgm-gpu-stats-$HOSTNAME-$SLURM_JOBID.out
groupid=$(sudo -u $SLURM_JOB_USER dcgmi group -l --json | jp "body.Groups.children.[*][0][?children."Group Name".value=='j$SLURM_JOBID'].children."Group ID".value | [0] " | sed s/"//g)
sudo -u $SLURM_JOB_USER dcgmi group --delete $groupid
MfG