[slurm-users] New node w/ 3 GPUs is not accepting GPUs tasks

Wed Jun 23 18:51:11 UTC 2021

Hello,

I just added a 3rd node to my slurm partition (called "hsw5"), as we
continue to enable Slurm in our environment.  But the new node is not
accepting jobs that require a GPU, despite the fact that it has 3 GPUs.

The other node that has a GPU ("devops3") is accepting GPU jobs as
expected.  A colleague pointed out an interesting difference (under the
GRES column) when we did this command:

(! 676)-> sinfo -o "%20N  %10c  %10m  %25f  %20G "
NODELIST              CPUS        MEMORY      AVAIL_FEATURES
GRES
devops2               4           9913        avx,centos,fast,fma,fma4,
 (null)
devops3               8           40213       centos,cuda10.1p,cuda10.2
*gpu:1(S:0-1)*
hsw5                  64          257847      foo,bar
*gpu:3*

Is there a problem with the GPU bindings on "hsw5"?  Do GPUs need to be
associated with sockets, or something like that?

Here is the error message I'm seeing:

(! 681)-> /opt/slurm-20.11.5/bin/sbatch --export=NONE -N 1 --constraint foo
--gpus=1 --wrap "ls"
sbatch: error: Batch job submission failed: Requested node configuration is
not available

(! 682)-> /opt/slurm-20.11.5/bin/sbatch --export=NONE -N 1 --constraint foo
 --wrap "ls"
Submitted batch job 385

Thanks for the help,

David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210623/a2269211/attachment.htm>