GPU Accounting - slurm-users

2 Oct 2024


      We have a node with 8 H100 GPUs that are split into MIG instances. We are using cgroups. This seems to work fine. Users can do something like
sbatch --gres="gpu:1g.10gb:1"...
and the job starts on the node with the gpus and cuda visible devices and the pytorch debug shows that the cgroup only gives them the gpu they asked for.
In the accounting database, jobs in the job table always have the "gres_used" column be empty. I'd expect to see "gpu:1g.10gb:1" appearing for the job above.
I have this set in slurm.conf
AccountingStorageTRES=gres/gpu
How can I see what gres was requested with the job ? At the moment I only see something like this in AllocTres
billing=1,cpu=1,gres/gpu=1,mem=8G,node=1
and can't see any way to see what the specific MIG gpu asked for was. This is related to the email from Richard Lefebvre dated 7th June 2023 entitled "Billing/accounting for MIGs is not working". As far as I can see this got no replies.
We are running slurm version 23.11.6.
Regards,
Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation