[slurm-users] Multiple Program Runs using srun in one Slurm batch Job on one node

Guillaume De Nayer denayer at hsu-hh.de
Thu Jun 16 07:24:09 UTC 2022


Hi Gareth,

I think you solved the problem. In my slurm.conf no setting on the
Memory was set (not for the node definition, not for the partition). I
change that and I add also "--mem-per-cpu 1" in the srun. It seems to
work. I will test it now with mpirun.

Thx a lot for your help!
Regards
Guillaume



On 06/15/2022 11:20 PM, Williams, Gareth (IM&T, Black Mountain) wrote:
> I think the problem might be that you are not requesting memory, so by default, all memory on a node is allocated to the job and "cons_res" will not allocate a second job to any node. That comes up quite often.
> 
> Gareth
> 
> -----Original Message-----
> From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Guillaume De Nayer
> Sent: Thursday, 16 June 2022 1:37 AM
> To: slurm-users at lists.schedmd.com
> Subject: Re: [slurm-users] Multiple Program Runs using srun in one Slurm batch Job on one node
> 
> On 06/15/2022 05:25 PM, Ward Poelmans wrote:
>> Hi Guillaume,
>>
>> On 15/06/2022 16:59, Guillaume De Nayer wrote:
>>>
>>> Perhaps I missunderstand the Slurm documentation...
>>>
>>> As thought that the --exclusive option used in combination with 
>>> sbatch will reserve the whole node (40 cores) for the job (submitted 
>>> with sbatch). This part is working fine. I can check it with sacct.
>>>
>>> Then, this job starts subtasks on the reserved 40 cores with srun.
>>> Therefore I'm using "-n1 -c1" in combination with "srun". I thought 
>>> that it was possible to use the reserved cores inside this job using srun.
>>
>> You're correct. --exclusive will give you all cores on the nodes but 
>> only as much memory as requested.
>>
>>  
>>> The following slightly modified job without --exclusive and with
>>> --ntasks=2 leads to a similar problem: Only one srun is running at a 
>>> time. The second starts directly after the first one finished.
>>>
>>> #!/bin/bash
>>> #SBATCH --job-name=test_multi_prog_srun #SBATCH --ntasks=2 #SBATCH 
>>> --partition=short #SBATCH --time=02:00:00
>>>
>>> srun -vvv --exact -n1 -c1 sleep 20 > srun1.log 2>&1 & srun -vvv 
>>> --exact -n1 -c1 sleep 30 > srun2.log 2>&1 & wait
>>
>> This should work... It works on our cluster. Are you sure they don't 
>> run in parallel?
>>
> 
> Yes I'm pretty sure that it does not work in parallel: The command sacct show me only on subtask "RUNNING". Then, when this subtask is marked as "COMPLETED", the second one appears and is marked "RUNNING".
> 
> Moreover, if I directly connect on the node, only one process of "sleep"
> is running.
> 
> ok. If it works on your cluster, I have perhaps a problem in my slurm config. Which version of slurm are you using on your cluster? And can you share your slurm.conf?
> 
>> We usually recommend to use gnu parallel or xargs like:
>>
>> xargs -P $SLURM_NTASKS srun -N 1 -n 1 -c 1 --exact sleep 30
>>
> 
> ok. I will install "gnu parallel" and also test your xargs command.
> 
> Thx a lot!
> Guillaume
> 
> 





More information about the slurm-users mailing list