[slurm-users] Interactive jobs using "srun --pty bash" and MPI

Juergen Salk juergen.salk at uni-ulm.de
Wed Nov 2 23:45:01 UTC 2022

Hi Em,

this is most probably because in Slurm version 20.11 the behaviour of srun was 
changed to not allow job steps to overlap by default any more.

An interactive job launched by `srun --pty bash´ always creates a regular 
step (step <jobid>.0), so mpirun or srun will hang when trying to launch another 
job step from within this interactive job step as they would overlap. 

You could try using the --overlap flag or `export SLURM_OVERLAP=1´
before running your interactive job to revert to the previous behavior
that allows steps to overlap. 

However, instead of using `srun --pty bash´ for launching interactive jobs, it 
is now recommended to use `salloc´ and have `LaunchParameters=use_interactive_step´ 
set in slurm.conf. 

`salloc´ with `LaunchParameters=use_interactive_step´ enabled will
create a special interactive step (step <jobid>.interactive) that does not 
consume any resources and, thus, does not interfere with a new job step 
launched from within this special interactive job step.

Hope this helps.

Best regards

* Em Dragowsky <dragowsky at case.edu> [221102 15:46]:
> Greetings --
> When we started using Slurm some years ago, obtaining the interactive
> resources through "srun ... --pty bash" was the standard that we adopted.
> We are now running Slurm v22.05 (happily), though we noticed recently some
> limitations when claiming resources to demonstrate or develop in an mpi
> environment.  A colleague today was revisiting a finding dating back to
> January, which is:
> I am having issues running interactive MPI jobs in a traditional way. It
> > just stays there without execution.
> >
> > srun -N 2 -n 4 --mem=4gb --pty bash
> > mpirun -n 4 ~/prime-mpi
> >
> > Hower, it does run with:
> > srun -N 2 -n 4 --mem=4gb  ~/prime-mpi
> >
> As indicated, the first approach, taking the resources to test/demo MPI
> jobs via "srun ...  --pty bash" no longer supports the launching of the
> job.  We also checked the srun environment using verbosity, and found that
> the job steps are executed and terminate before the prompt is achieved in
> the requested shell.
> While we infer that changes were implemented, would someone be able to
> direct us to documentation or a discussion as to the changes, and the
> motivation?  We do not doubt that there is compelling motivation, we ask to
> improve our understanding.  As was summarized in and shared amongst our
> team following our review of the current operational behaviour:
> >
> >    - "srun ... executable" works fine
> >    - "salloc -n4", "ssh <node>", "srun -n4 <executable>" works
> >    Using "mpirun -n4 <executable>" does not work
> >    - In batch mode, both mpirun and srun work.
> >
> >
> Thanks to any and all who take the time to shed light on this matter.
> -- 
> E.M. (Em) Dragowsky, Ph.D.
> Research Computing -- UTech
> Case Western Reserve University
> (216) 368-0082
> they/them

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4965 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221103/4de9b9f6/attachment.bin>

More information about the slurm-users mailing list