[slurm-users] GPU machines only run a single GPU job despite resources being available.
Benjamin Wong
bwong at keiserlab.org
Tue Jul 16 22:10:56 UTC 2019
Hi everyone,
I have a slurm node named, mk-gpu-1, with eight GPUs which I've been
testing sending GPU based container jobs to. For whatever reason, it will
only run a single GPU at a time. All other SLURM sent GPU jobs have a
pending (PD) state due to "(Resources)".
[ztang at mk-gpu-1 ~]$ squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
523 gpu.q slurm-gp ztang PD 0:00 1
(Resources)
522 gpu.q slurm-gp bwong1 R 0:09 1 mk-gpu-1
Anyone know why this would happen? I'll try to provide the relevant
portions of my configuration:
*slurm.conf: *
GresTypes=gpu
AccountingStorageTres=gres/gpu
DebugFlags=CPU_Bind,gres
NodeName=mk-gpu-1 NodeAddr=10.10.100.106 RealMemory=500000 Gres=gpu:8
Sockets=2 CoresPerSocket=12 ThreadsPerCore=2 State=UNKNOWN
PartitionName=gpu.q Nodes=mk-gpu-1,mk-gpu-2,mk-gpu-3 Default=NO
MaxTime=INFINITE State=UP
*gres.conf*
# This line is causing issues in Slurm 19.05
#AutoDetect=nvml
NodeName=mk-gpu-1 Name=gpu File=/dev/nvidia[0-7]
(I commented out AutoDetect=nvml because Slurm will not start properly and
will output: "slurmd[28070]: fatal: We were configured to autodetect nvml
functionality, but we weren't able to find that lib when Slurm was
configured." Could use some help there too if possible. )
*cgroup.conf*
CgroupAutomount=yes
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
ConstrainDevices=yes
*submission script:*
#!/bin/bash
#SBATCH -c 2
#SBATCH -o slurm-gpu-job.out
#SBATCH -p gpu.q
#SBATCH -w mk-gpu-1
#SBATCH --gres=gpu:1
srun singularity exec --nv docker://tensorflow/tensorflow:latest-gpu \
python ./models/tutorials/image/mnist/convolutional.py
Thanks in advance for any ideas,
Benjamin Wong
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190716/a10384b0/attachment.htm>
More information about the slurm-users
mailing list