[slurm-users] GPU machines only run a single GPU job despite resources being available.

Tue Jul 16 22:10:56 UTC 2019

Hi everyone,

I have a slurm node named, mk-gpu-1, with eight GPUs which I've been
testing sending GPU based container jobs to.  For whatever reason,  it will
only run a single GPU at a time.  All other SLURM sent GPU jobs have a
pending (PD) state due to "(Resources)".

[ztang at mk-gpu-1 ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES
NODELIST(REASON)
               523     gpu.q slurm-gp    ztang PD       0:00      1
(Resources)
               522     gpu.q slurm-gp   bwong1  R       0:09      1 mk-gpu-1

Anyone know why this would happen?  I'll try to provide the relevant
portions of my configuration:

*slurm.conf: *
GresTypes=gpu
AccountingStorageTres=gres/gpu
DebugFlags=CPU_Bind,gres
NodeName=mk-gpu-1 NodeAddr=10.10.100.106 RealMemory=500000 Gres=gpu:8
Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 State=UNKNOWN
PartitionName=gpu.q Nodes=mk-gpu-1,mk-gpu-2,mk-gpu-3  Default=NO
MaxTime=INFINITE State=UP

*gres.conf*
# This line is causing issues in Slurm 19.05
#AutoDetect=nvml
NodeName=mk-gpu-1 Name=gpu File=/dev/nvidia[0-7]
(I commented out AutoDetect=nvml because Slurm will not start properly and
will output: "slurmd[28070]: fatal: We were configured to autodetect nvml
functionality, but we weren't able to find that lib when Slurm was
configured."  Could use some help there too if possible.  )

*cgroup.conf*
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
ConstrainDevices=yes

*submission script:*
#!/bin/bash
#SBATCH -c 2
#SBATCH -o slurm-gpu-job.out
#SBATCH -p gpu.q
#SBATCH -w mk-gpu-1
#SBATCH --gres=gpu:1
srun singularity exec --nv docker://tensorflow/tensorflow:latest-gpu \
python ./models/tutorials/image/mnist/convolutional.py

Thanks in advance for any ideas,
Benjamin Wong
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190716/a10384b0/attachment.htm>