[slurm-users] how can users start their worker daemons using srun?

Fri Aug 31 10:33:39 MDT 2018

> On Aug 28, 2018, at 6:13 PM, Christopher Samuel <chris at csamuel.org> wrote:
> 
> On 29/08/18 09:10, Priedhorsky, Reid wrote:
> 
>> This is surprising to me, as my interpretation is that the first run
>> should allocate only one CPU, leaving 35 for the second srun, which
>> also only needs one CPU and need not wait.
>> Is this behavior expected? Am I missing something?
> 
> That's odd - and I can reproduce what you see here with Slurm 17.11.7!
> 
> However, on an older system I have access to where I know this technique
> is used with 16.05.8 it does work.
> 
> My test script is:
> 
> ---------------8< snip snip 8<---------------
> #!/bin/bash
> #SBATCH -n2
> #SBATCH -c2
> #SBATCH --mem-per-cpu=2g
> 
> srun -n1 --mem-per-cpu=500m sleep 5 &
> srun -n1 --mem-per-cpu=1g hostname
> ---------------8< snip snip 8<---------------

Adding in memory seems to work (Bash job control chatter removed):

  $ srun -n1 -c1 --mem=1K sh -c './bar.py && sleep 30' &
  $ srun -n1 -c1 --mem=1K hostname
  cn001.localdomain
  $

hostname runs immediately, and I don’t get the warning about waiting anymore.

bar.py is another test script that forks one child per CPU that allocates 128MiB of memory and then busy-loops for about 20 seconds. I confirmed with top that it’s really running on all 36 CPUs.

That is, it exceeds both the CPU count (1) and memory (1KiB) that I told Slurm it would use. This is what I want. Is allowing such exceedance a common configuration? I don’t want to rely on quirks of our site.

The drawback here is that for real daemons, I’ll need “sleep infinity”, so I’ll need to manually kill the srun. So, this is still a workaround. The ideal behavior would be to have Slurm not clean up processes when the job step completes, but instead at the end of the job.

Thanks,
Reid