[slurm-users] gres:gpu managment
Daniel Vecerka
vecerka at fel.cvut.cz
Thu May 23 07:31:47 UTC 2019
Hello,
we are running 18.08.6 and has problems with GRES GPU management.
There is "gpu" partition with 12 nodes each with 4 Tesla V100 cards. An
allocation of the GPUs is working, GPU management for sbatch/srun jobs
is working too - CUDA_VISIBLE_DEVICES is correctly set according
--gres=gpu:x option. But we have problems with GPU management for job
steps. If I'll try this example:
#!/bin/bash
#
# gres_test.bash
# Submit as follows:
# sbatch -p gpu --gres=gpu:4 -n4 gres_test.bash
#
echo JOB $SLURM_JOB_ID CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
srun --gres=gpu:1 -n1 --exclusive show_device.sh &
srun --gres=gpu:1 -n1 --exclusive show_device.sh &
srun --gres=gpu:1 -n1 --exclusive show_device.sh &
srun --gres=gpu:1 -n1 --exclusive show_device.sh &
wait
cat show_devices.sh
#!/bin/bash
echo JOB $SLURM_JOB_ID STEP $SLURM_STEP_ID
CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES
I'll get:
JOB 49614 CUDA_VISIBLE_DEVICES=0,1,2,3
JOB 49614 STEP 0 CUDA_VISIBLE_DEVICES=0
JOB 49614 STEP 1 CUDA_VISIBLE_DEVICES=0
JOB 49614 STEP 2 CUDA_VISIBLE_DEVICES=0
JOB 49614 STEP 3 CUDA_VISIBLE_DEVICES=0
But according: https://slurm.schedmd.com/gres.html I'm expecting:
JOB 49614 CUDA_VISIBLE_DEVICES=0,1,2,3
JOB 49614 STEP 0 CUDA_VISIBLE_DEVICES=0
JOB 49614 STEP 1 CUDA_VISIBLE_DEVICES=1
JOB 49614 STEP 2 CUDA_VISIBLE_DEVICES=2
JOB 49614 STEP 3 CUDA_VISIBLE_DEVICES=3
So we are not able distribute jobs to different GPUs inside sbatch . We
can use some wrapper like this:
#!/bin/bash
export CUDA_VISIBLE_DEVICES=$SLURM_STEPID
my_job
but SLURM built-in solution is better and more robust.
GRES section of slurm.conf
AccountingStorageTRES=gres/gpu
JobAcctGatherType=jobacct_gather/cgroup
GresTypes=gpu
NodeName=n[21-32] Gres=gpu:v100:4 Sockets=2 CoresPerSocket=18
ThreadsPerCore=2 RealMemory=384000 TmpDisk=150000 State=UNKNOWN Weight=1000
PartitionName=gpu Nodes=n[21-32] Default=NO MaxTime=24:00:00 State=UP
Priority=5 PriorityTier=15 OverSubscribe=FORCE
/etc/slurm/gres.conf
Name=gpu Type=v100 File=/dev/nvidia0 CPUs=0-17,36-53
Name=gpu Type=v100 File=/dev/nvidia1 CPUs=0-17,36-53
Name=gpu Type=v100 File=/dev/nvidia2 CPUs=18-35,54-71
Name=gpu Type=v100 File=/dev/nvidia3 CPUs=18-35,54-71
Any help appreciated.
Thanks, Daniel Vecerka CTU Prague
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3726 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190523/ff598cf2/attachment-0001.bin>
More information about the slurm-users
mailing list