[slurm-users] multiple srun commands in the same SLURM script

Andrei Berceanu andreicberceanu at gmail.com
Tue Oct 31 10:50:57 UTC 2023


Here is my SLURM script:

#!/bin/bash

#SBATCH --job-name="gpu_test"
#SBATCH --output=gpu_test_%j.log       # Standard output and error log
#SBATCH --account=berceanu_a+

#SBATCH --partition=gpu
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=31200m           # Reserve 32 GB of RAM per core
#SBATCH --time=12:00:00                # Max allowed job runtime
#SBATCH --gres=gpu:16                   # Allocate four GPUs

export SLURM_EXACT=1

srun --mpi=pmi2 -n 1 --gpus-per-node 1 python gpu_test.py &
srun --mpi=pmi2 -n 1 --gpus-per-node 1 python gpu_test.py &
srun --mpi=pmi2 -n 1 --gpus-per-node 1 python gpu_test.py &
srun --mpi=pmi2 -n 1 --gpus-per-node 1 python gpu_test.py &

wait

What I expect this to do is to run, in parallel, 4 independent copies
of the gpu_test.py python script, using 4 out of the 16 GPUs on this
node.

What it actually does is it only runs the script on a single GPU -
it's as if the other 3 srun commands do nothing. Perhaps they do not
see any available GPUs for some reason?

System info:

slurm 19.05.2

Linux 5.4.0-90-generic #101~18.04.1-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpu          up   infinite      1   idle thor

NodeName=thor Arch=x86_64 CoresPerSocket=24
   CPUAlloc=0 CPUTot=48 CPULoad=0.45
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:16(S:0-1)
   NodeAddr=thor NodeHostName=thor
   OS=Linux 5.4.0-90-generic #101~18.04.1-Ubuntu SMP Fri Oct 22
09:25:04 UTC 2021
   RealMemory=1546812 AllocMem=0 FreeMem=1433049 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=gpu
   BootTime=2023-08-09T14:58:01 SlurmdStartTime=2023-08-09T14:58:36
   CfgTRES=cpu=48,mem=1546812M,billing=48,gres/gpu=16
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

I can add any additional system info as required.

Thank you so much for taking the time to read this,

Regards,
Andrei



More information about the slurm-users mailing list