Hi,
In my HPC center, I found a SLURM job that was submitted with --gres=gpu:6 whereas the cluster has only four GPUs per node each. It is a parallel job. Here are some relevant field printout:
AllocCPUS 30 AllocGRES gpu:6 AllocTRES billing=30,cpu=30,gres/gpu=6,node=3 CPUTime 1-01:23:00 CPUTimeRAW 91380 Elapsed 00:50:46 JobID 20073 JobIDRaw 20073 JobName simple_cuda NCPUS 30 NGPUS 6.0
What happened in this case? This job was asking for 3 nodes, 10 core per node. When the user specified “--gres=gpu:6”, does this mean six GPUs for the entire job, or six GPUs per node? Per the description in https://slurm.schedmd.com/gres.html#Running_Jobs, it says: gres is “Generic resources required per node”. So it is illogical to request six GPUs per node. So what happened? Did SLURM quietly ignore the request and grant just one, or grant the max number (4)? Because apparently the job ran without error.
Wirawan Purwanto Computational Scientist, HPC Group Information Technology Services Old Dominion University Norfolk, VA 23529
Hi Wirawan,
in general `--gres=gpu:6´ actually means six units of a generic resource named `gpu´ per node. Each unit may or may not be associated with a physical GPU device.
I'd check the node configuration for the number of gres=gpu resource units that are configured for that node.
scontrol show node <node>
Maybe your GPU devices are multi instance GPUs (MIG) with each one being split into multiple separate GPU instances and every gres=gpu unit counts against the total number of MIG instances rather than the number of physical GPU devices on the nodes?
Best regards Jürgen
* Purwanto, Wirawan wpurwant@odu.edu [240117 15:54]:
Hi,
In my HPC center, I found a SLURM job that was submitted with --gres=gpu:6 whereas the cluster has only four GPUs per node each. It is a parallel job. Here are some relevant field printout:
AllocCPUS 30 AllocGRES gpu:6 AllocTRES billing=30,cpu=30,gres/gpu=6,node=3 CPUTime 1-01:23:00 CPUTimeRAW 91380 Elapsed 00:50:46 JobID 20073 JobIDRaw 20073 JobName simple_cuda NCPUS 30 NGPUS 6.0
What happened in this case? This job was asking for 3 nodes, 10 core per node. When the user specified “--gres=gpu:6”, does this mean six GPUs for the entire job, or six GPUs per node? Per the description in https://slurm.schedmd.com/gres.html#Running_Jobs, it says: gres is “Generic resources required per node”. So it is illogical to request six GPUs per node. So what happened? Did SLURM quietly ignore the request and grant just one, or grant the max number (4)? Because apparently the job ran without error.
Wirawan Purwanto Computational Scientist, HPC Group Information Technology Services Old Dominion University Norfolk, VA 23529