[slurm-users] Multiple Program Runs using srun in one Slurm batch Job on one node

Guillaume De Nayer denayer at hsu-hh.de
Wed Jun 15 13:08:37 UTC 2022


On 06/15/2022 02:48 PM, Tina Friedrich wrote:
> Hi Guillaume,
> 

Hi Tina,

> in that example you wouldn't need the 'srun' to run more than one task,
> I think.
> 

You are correct. To start a program like sleep I could simply run:
sleep 20s &
sleep 30s &
wait

However, my objective is to use mpirun in combination with srun to avoid
to define manually rankfile.

> 
> I'm not 100% sure, but to me it sounds like you're currently assigning
> whole nodes to jobs rather than cores (i.e have
> 'SelectType=select/linear' and no OverSubscribe) and find that to be
> wasteful - is that correct?
> 

In my first email I copy parts of my slurm.conf. I'm using
"SelectType=select/cons_res"

with

"SelectTypeParameters=CR_Core_Memory"

And until now "no OverSubscribe". I tried to activate
"OverSubscribe=YES" on the partition with

PartitionName=short   Nodes=node[01-08] Default=NO  MaxTime=0-02:00:00
State=UP DefaultTime=00:00:00 MinNodes=1 PriorityTier=100 OverSubscribe=YES

But it did not solve the issue with

srun -vvv --exact -n1 -c1 sleep 20 > srun1.log 2>&1 &
srun -vvv --exact -n1 -c1 sleep 30 > srun2.log 2>&1 &
wait


> If it is, I'd say the more obvious solution to that would be to change
> the SelectType to either select/cons_res or select/cons_tres, so that
> cores (not nodes) are allocated to jobs?
> 

How can I be sure that my slurm is using the parameter "select/cons_res"
defined in my /etc/slurm/slurm.conf?

Thx a lot
Guillaume


> Tina
> 
> On 15/06/2022 13:20, Guillaume De Nayer wrote:
>> Dear all,
>>
>> I'm new on this list. I am responsible for several small clusters at our
>> chair.
>>
>> I set up slurm 21.08.8-2 on a small cluster (CentOS 7) with 8 nodes:
>> NodeName=node0[1-8] CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
>> ThreadsPerCore=1
>>
>> One collegue has to run 20,000 jobs on this machine. Every job starts
>> his program with mpirun on 12 cores. The standard slurm behavior makes
>> that the node, which runs this job is blocked (and 28 cores are idle).
>> The small cluster has only 8 nodes, so only 8 jobs can run in parallel.
>>
>> In order to solve this problem I'm trying to start some subtasks with
>> srun inside a batch job (without mpirun for now):
>>
>> #!/bin/bash
>> #SBATCH --job-name=test_multi_prog_srun
>> #SBATCH --nodes=1
>> #SBATCH --partition=short
>> #SBATCH --time=02:00:00
>> #SBATCH --exclusive
>>
>> srun -vvv --exact -n1 -c1 sleep 20 > srun1.log 2>&1 &
>> srun -vvv --exact -n1 -c1 sleep 30 > srun2.log 2>&1 &
>> wait
>>
>>
>> However, only one task runs. The second is waiting for the completion of
>> the first task to start.
>>
>> Can someone explain me, what I'm doing wrong?
>>
>>
>> Thx in advance,
>> Regards,
>> Guillaume
>>
>>
>> # slurm.conf file
>> MpiDefault=none
>> ProctrackType=proctrack/linuxproc
>> ReturnToService=1
>> SlurmUser=root
>> SwitchType=switch/none
>> TaskPlugin=task/none
>> SchedulerType=sched/backfill
>> SelectType=select/cons_res
>> SelectTypeParameters=CR_Core_Memory
>> AccountingStorageEnforce=limits
>> AccountingStorageType=accounting_storage/slurmdbd
>> AccountingStoreFlags=job_comment
>> JobAcctGatherFrequency=30
>> SlurmctldDebug=error
>> SlurmdDebug=error
>> SlurmctldLogFile=/var/log/slurmctld.log
>> SlurmdLogFile=/var/log/slurmd.log
>>
>> NodeName=node0[1-8] CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=20
>> ThreadsPerCore=1 State=UNKNOWN
>> PartitionName=short   Nodes=node[01-08] Default=NO  MaxTime=0-02:00:00
>> State=UP DefaultTime=00:00:00 MinNodes=1 PriorityTier=100
>>
>>
>>
> 





More information about the slurm-users mailing list