[slurm-users] Multiple Program Runs using srun in one Slurm batch Job on one node
Guillaume De Nayer
denayer at hsu-hh.de
Wed Jun 15 12:20:56 UTC 2022
Dear all,
I'm new on this list. I am responsible for several small clusters at our
chair.
I set up slurm 21.08.8-2 on a small cluster (CentOS 7) with 8 nodes:
NodeName=node0[1-8] CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
ThreadsPerCore=1
One collegue has to run 20,000 jobs on this machine. Every job starts
his program with mpirun on 12 cores. The standard slurm behavior makes
that the node, which runs this job is blocked (and 28 cores are idle).
The small cluster has only 8 nodes, so only 8 jobs can run in parallel.
In order to solve this problem I'm trying to start some subtasks with
srun inside a batch job (without mpirun for now):
#!/bin/bash
#SBATCH --job-name=test_multi_prog_srun
#SBATCH --nodes=1
#SBATCH --partition=short
#SBATCH --time=02:00:00
#SBATCH --exclusive
srun -vvv --exact -n1 -c1 sleep 20 > srun1.log 2>&1 &
srun -vvv --exact -n1 -c1 sleep 30 > srun2.log 2>&1 &
wait
However, only one task runs. The second is waiting for the completion of
the first task to start.
Can someone explain me, what I'm doing wrong?
Thx in advance,
Regards,
Guillaume
# slurm.conf file
MpiDefault=none
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmUser=root
SwitchType=switch/none
TaskPlugin=task/none
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
AccountingStorageEnforce=limits
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreFlags=job_comment
JobAcctGatherFrequency=30
SlurmctldDebug=error
SlurmdDebug=error
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdLogFile=/var/log/slurmd.log
NodeName=node0[1-8] CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
ThreadsPerCore=1 State=UNKNOWN
PartitionName=short Nodes=node[01-08] Default=NO MaxTime=0-02:00:00
State=UP DefaultTime=00:00:00 MinNodes=1 PriorityTier=100
More information about the slurm-users
mailing list