We have a node with 8 H100 GPUs that are split into MIG instances. We are using cgroups. This seems to work fine. Users can do something like
sbatch --gres="gpu:1g.10gb:1"...
and the job starts on the node with the gpus and cuda visible devices and the pytorch debug shows that the cgroup only gives them the gpu they asked for.
In the accounting database, jobs in the job table always have the "gres_used" column be empty. I'd expect to see "gpu:1g.10gb:1" appearing for the job above.
I have this set in slurm.conf
AccountingStorageTRES=gres/gpu
How can I see what gres was requested with the job ? At the moment I only see something like this in AllocTres
billing=1,cpu=1,gres/gpu=1,mem=8G,node=1
and can't see any way to see what the specific MIG gpu asked for was. This is related to the email from Richard Lefebvre dated 7th June 2023 entitled "Billing/accounting for MIGs is not working". As far as I can see this got no replies.
We are running slurm version 23.11.6.
Regards,
Emyr James Head of Scientific IT CRG - Centre for Genomic Regulation
Emyr James via slurm-users slurm-users@lists.schedmd.com writes:
I have this set in slurm.conf
AccountingStorageTRES=gres/gpu
I believe you need to list all types of GPUs (including MIGs) that you have configured on the nodes, in addition to the general "gres/gpu". For instance, on one of our clusters, we have
AccountingStorageTRES=gres/gpu,gres/gpu:a100,gres/gpu:rtx30,gres/gpu:1g.20gb,gres/gpu:a40
Then AllocTRES from sacct will show things like
billing=19,cpu=6,gres/gpu:a100=1,gres/gpu=1,mem=12G,node=1
depending on what the job specifies.