[slurm-users] Multiple Program Runs using srun in one Slurm batch Job on one node

Tina Friedrich tina.friedrich at it.ox.ac.uk
Wed Jun 15 12:48:24 UTC 2022


Hi Guillaume,

in that example you wouldn't need the 'srun' to run more than one task, 
I think.


I'm not 100% sure, but to me it sounds like you're currently assigning 
whole nodes to jobs rather than cores (i.e have 
'SelectType=select/linear' and no OverSubscribe) and find that to be 
wasteful - is that correct?

If it is, I'd say the more obvious solution to that would be to change 
the SelectType to either select/cons_res or select/cons_tres, so that 
cores (not nodes) are allocated to jobs?

Tina

On 15/06/2022 13:20, Guillaume De Nayer wrote:
> Dear all,
> 
> I'm new on this list. I am responsible for several small clusters at our
> chair.
> 
> I set up slurm 21.08.8-2 on a small cluster (CentOS 7) with 8 nodes:
> NodeName=node0[1-8] CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
> ThreadsPerCore=1
> 
> One collegue has to run 20,000 jobs on this machine. Every job starts
> his program with mpirun on 12 cores. The standard slurm behavior makes
> that the node, which runs this job is blocked (and 28 cores are idle).
> The small cluster has only 8 nodes, so only 8 jobs can run in parallel.
> 
> In order to solve this problem I'm trying to start some subtasks with
> srun inside a batch job (without mpirun for now):
> 
> #!/bin/bash
> #SBATCH --job-name=test_multi_prog_srun
> #SBATCH --nodes=1
> #SBATCH --partition=short
> #SBATCH --time=02:00:00
> #SBATCH --exclusive
> 
> srun -vvv --exact -n1 -c1 sleep 20 > srun1.log 2>&1 &
> srun -vvv --exact -n1 -c1 sleep 30 > srun2.log 2>&1 &
> wait
> 
> 
> However, only one task runs. The second is waiting for the completion of
> the first task to start.
> 
> Can someone explain me, what I'm doing wrong?
> 
> 
> Thx in advance,
> Regards,
> Guillaume
> 
> 
> # slurm.conf file
> MpiDefault=none
> ProctrackType=proctrack/linuxproc
> ReturnToService=1
> SlurmUser=root
> SwitchType=switch/none
> TaskPlugin=task/none
> SchedulerType=sched/backfill
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core_Memory
> AccountingStorageEnforce=limits
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStoreFlags=job_comment
> JobAcctGatherFrequency=30
> SlurmctldDebug=error
> SlurmdDebug=error
> SlurmctldLogFile=/var/log/slurmctld.log
> SlurmdLogFile=/var/log/slurmd.log
> 
> NodeName=node0[1-8] CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
> ThreadsPerCore=1 State=UNKNOWN
> PartitionName=short   Nodes=node[01-08] Default=NO  MaxTime=0-02:00:00
> State=UP DefaultTime=00:00:00 MinNodes=1 PriorityTier=100
> 
> 
> 

-- 
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk



More information about the slurm-users mailing list