[slurm-users] New node w/ 3 GPUs is not accepting GPUs tasks
David Henkemeyer
david.henkemeyer at gmail.com
Wed Jun 23 18:51:11 UTC 2021
Hello,
I just added a 3rd node to my slurm partition (called "hsw5"), as we
continue to enable Slurm in our environment. But the new node is not
accepting jobs that require a GPU, despite the fact that it has 3 GPUs.
The other node that has a GPU ("devops3") is accepting GPU jobs as
expected. A colleague pointed out an interesting difference (under the
GRES column) when we did this command:
(! 676)-> sinfo -o "%20N %10c %10m %25f %20G "
NODELIST CPUS MEMORY AVAIL_FEATURES
GRES
devops2 4 9913 avx,centos,fast,fma,fma4,
(null)
devops3 8 40213 centos,cuda10.1p,cuda10.2
*gpu:1(S:0-1)*
hsw5 64 257847 foo,bar
*gpu:3*
Is there a problem with the GPU bindings on "hsw5"? Do GPUs need to be
associated with sockets, or something like that?
Here is the error message I'm seeing:
(! 681)-> /opt/slurm-20.11.5/bin/sbatch --export=NONE -N 1 --constraint foo
--gpus=1 --wrap "ls"
sbatch: error: Batch job submission failed: Requested node configuration is
not available
(! 682)-> /opt/slurm-20.11.5/bin/sbatch --export=NONE -N 1 --constraint foo
--wrap "ls"
Submitted batch job 385
Thanks for the help,
David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210623/a2269211/attachment.htm>
More information about the slurm-users
mailing list