[slurm-users] Interactive jobs using "srun --pty bash" and MPI

Em Dragowsky dragowsky at case.edu
Thu Nov 3 17:02:28 UTC 2022


Hi, Juergen --

This is really useful information -- thanks for the pointer, and for taking
the time to share!

And, Jacob -- can you point us to any primary documentation based on
Juergen's observation that the change took place with v20.11?

With the emphasis on salloc, I find in the examples:

>        To get an allocation, and open a new xterm in which srun commands
> may be typed interactively:
>
>               $ salloc -N16 xterm
>               salloc: Granted job allocation 65537
>

which works as advertised (I'm not sure that i miss xterms or not -- at
least on our cluster we dont configure them explicitly as a primary
terminal tool)

And thanks also Chris and Jason for the validation and endorsement of these
approaches.

Best, all!
~ Em

On Wed, Nov 2, 2022 at 5:47 PM Juergen Salk <juergen.salk at uni-ulm.de> wrote:

> Hi Em,
>
> this is most probably because in Slurm version 20.11 the behaviour of srun
> was
> changed to not allow job steps to overlap by default any more.
>
> An interactive job launched by `srun --pty bash´ always creates a regular
> step (step <jobid>.0), so mpirun or srun will hang when trying to launch
> another
> job step from within this interactive job step as they would overlap.
>
> You could try using the --overlap flag or `export SLURM_OVERLAP=1´
> before running your interactive job to revert to the previous behavior
> that allows steps to overlap.
>
> However, instead of using `srun --pty bash´ for launching interactive
> jobs, it
> is now recommended to use `salloc´ and have
> `LaunchParameters=use_interactive_step´
> set in slurm.conf.
>
> `salloc´ with `LaunchParameters=use_interactive_step´ enabled will
> create a special interactive step (step <jobid>.interactive) that does not
> consume any resources and, thus, does not interfere with a new job step
> launched from within this special interactive job step.
>
> Hope this helps.
>
> Best regards
> Jürgen
>
>
> * Em Dragowsky <dragowsky at case.edu> [221102 15:46]:
> > Greetings --
> >
> > When we started using Slurm some years ago, obtaining the interactive
> > resources through "srun ... --pty bash" was the standard that we adopted.
> > We are now running Slurm v22.05 (happily), though we noticed recently
> some
> > limitations when claiming resources to demonstrate or develop in an mpi
> > environment.  A colleague today was revisiting a finding dating back to
> > January, which is:
> >
> > I am having issues running interactive MPI jobs in a traditional way. It
> > > just stays there without execution.
> > >
> > > srun -N 2 -n 4 --mem=4gb --pty bash
> > > mpirun -n 4 ~/prime-mpi
> > >
> > > Hower, it does run with:
> > > srun -N 2 -n 4 --mem=4gb  ~/prime-mpi
> > >
> >
> > As indicated, the first approach, taking the resources to test/demo MPI
> > jobs via "srun ...  --pty bash" no longer supports the launching of the
> > job.  We also checked the srun environment using verbosity, and found
> that
> > the job steps are executed and terminate before the prompt is achieved in
> > the requested shell.
> >
> > While we infer that changes were implemented, would someone be able to
> > direct us to documentation or a discussion as to the changes, and the
> > motivation?  We do not doubt that there is compelling motivation, we ask
> to
> > improve our understanding.  As was summarized in and shared amongst our
> > team following our review of the current operational behaviour:
> >
> > >
> > >    - "srun ... executable" works fine
> > >    - "salloc -n4", "ssh <node>", "srun -n4 <executable>" works
> > >    Using "mpirun -n4 <executable>" does not work
> > >    - In batch mode, both mpirun and srun work.
> > >
> > >
> > Thanks to any and all who take the time to shed light on this matter.
> >
> >
> > --
> > E.M. (Em) Dragowsky, Ph.D.
> > Research Computing -- UTech
> > Case Western Reserve University
> > (216) 368-0082
> > they/them
>
>

-- 
E.M. (Em) Dragowsky, Ph.D.
Research Computing -- UTech
Case Western Reserve University
(216) 368-0082
they/them
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221103/c69b8ffc/attachment.htm>


More information about the slurm-users mailing list