[slurm-users] What happens if GPU GRES exceeding number of GPUs per node
Purwanto, Wirawan
wpurwant at odu.edu
Wed Jan 17 15:54:05 UTC 2024
Hi,
In my HPC center, I found a SLURM job that was submitted with --gres=gpu:6 whereas the cluster has only four GPUs per node each. It is a parallel job. Here are some relevant field printout:
AllocCPUS 30
AllocGRES gpu:6
AllocTRES billing=30,cpu=30,gres/gpu=6,node=3
CPUTime 1-01:23:00
CPUTimeRAW 91380
Elapsed 00:50:46
JobID 20073
JobIDRaw 20073
JobName simple_cuda
NCPUS 30
NGPUS 6.0
What happened in this case? This job was asking for 3 nodes, 10 core per node. When the user specified “--gres=gpu:6”, does this mean six GPUs for the entire job, or six GPUs per node? Per the description in https://slurm.schedmd.com/gres.html#Running_Jobs, it says: gres is “Generic resources required per node”. So it is illogical to request six GPUs per node. So what happened? Did SLURM quietly ignore the request and grant just one, or grant the max number (4)? Because apparently the job ran without error.
Wirawan Purwanto
Computational Scientist, HPC Group
Information Technology Services
Old Dominion University
Norfolk, VA 23529
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20240117/8ac7eeb1/attachment-0001.htm>
More information about the slurm-users
mailing list