[slurm-users] Interactive jobs using "srun --pty bash" and MPI
Juergen Salk
juergen.salk at uni-ulm.de
Thu Nov 3 19:45:18 UTC 2022
Hi Em,
this was somehow mentioned in the 20.11.0 Release Notes:
https://github.com/SchedMD/slurm/blob/slurm-20-11-0-1/RELEASE_NOTES
-- By default a step started with srun will get --exclusive behavior meaning
no other parallel step will be allowed to run on the same resources at
the same time. To get the previous default behavior which allowed
parallel steps to share all resources use the new srun '--overlap' option.
[...]
-- Remove SallocDefaultCommand option.
-- Add support for an "Interactive Step", designed to be used with salloc to
launch a terminal on an allocated compute node automatically. Enable by
setting "use_interactive_step" as part of LaunchParameters.
I think the release notes could have been a bit clearer here and I also didn't
realize the implications until I ran into exactly the same problem with interactive
jobs as you did.
Best regards
Jürgen
* Em Dragowsky <dragowsky at case.edu> [221103 11:02]:
> Hi, Juergen --
>
> This is really useful information -- thanks for the pointer, and for taking
> the time to share!
>
> And, Jacob -- can you point us to any primary documentation based on
> Juergen's observation that the change took place with v20.11?
>
> With the emphasis on salloc, I find in the examples:
>
> > To get an allocation, and open a new xterm in which srun commands
> > may be typed interactively:
> >
> > $ salloc -N16 xterm
> > salloc: Granted job allocation 65537
> >
>
> which works as advertised (I'm not sure that i miss xterms or not -- at
> least on our cluster we dont configure them explicitly as a primary
> terminal tool)
>
> And thanks also Chris and Jason for the validation and endorsement of these
> approaches.
>
> Best, all!
> ~ Em
>
> On Wed, Nov 2, 2022 at 5:47 PM Juergen Salk <juergen.salk at uni-ulm.de> wrote:
>
> > Hi Em,
> >
> > this is most probably because in Slurm version 20.11 the behaviour of srun
> > was
> > changed to not allow job steps to overlap by default any more.
> >
> > An interactive job launched by `srun --pty bash´ always creates a regular
> > step (step <jobid>.0), so mpirun or srun will hang when trying to launch
> > another
> > job step from within this interactive job step as they would overlap.
> >
> > You could try using the --overlap flag or `export SLURM_OVERLAP=1´
> > before running your interactive job to revert to the previous behavior
> > that allows steps to overlap.
> >
> > However, instead of using `srun --pty bash´ for launching interactive
> > jobs, it
> > is now recommended to use `salloc´ and have
> > `LaunchParameters=use_interactive_step´
> > set in slurm.conf.
> >
> > `salloc´ with `LaunchParameters=use_interactive_step´ enabled will
> > create a special interactive step (step <jobid>.interactive) that does not
> > consume any resources and, thus, does not interfere with a new job step
> > launched from within this special interactive job step.
> >
> > Hope this helps.
> >
> > Best regards
> > Jürgen
> >
> >
> > * Em Dragowsky <dragowsky at case.edu> [221102 15:46]:
> > > Greetings --
> > >
> > > When we started using Slurm some years ago, obtaining the interactive
> > > resources through "srun ... --pty bash" was the standard that we adopted.
> > > We are now running Slurm v22.05 (happily), though we noticed recently
> > some
> > > limitations when claiming resources to demonstrate or develop in an mpi
> > > environment. A colleague today was revisiting a finding dating back to
> > > January, which is:
> > >
> > > I am having issues running interactive MPI jobs in a traditional way. It
> > > > just stays there without execution.
> > > >
> > > > srun -N 2 -n 4 --mem=4gb --pty bash
> > > > mpirun -n 4 ~/prime-mpi
> > > >
> > > > Hower, it does run with:
> > > > srun -N 2 -n 4 --mem=4gb ~/prime-mpi
> > > >
> > >
> > > As indicated, the first approach, taking the resources to test/demo MPI
> > > jobs via "srun ... --pty bash" no longer supports the launching of the
> > > job. We also checked the srun environment using verbosity, and found
> > that
> > > the job steps are executed and terminate before the prompt is achieved in
> > > the requested shell.
> > >
> > > While we infer that changes were implemented, would someone be able to
> > > direct us to documentation or a discussion as to the changes, and the
> > > motivation? We do not doubt that there is compelling motivation, we ask
> > to
> > > improve our understanding. As was summarized in and shared amongst our
> > > team following our review of the current operational behaviour:
> > >
> > > >
> > > > - "srun ... executable" works fine
> > > > - "salloc -n4", "ssh <node>", "srun -n4 <executable>" works
> > > > Using "mpirun -n4 <executable>" does not work
> > > > - In batch mode, both mpirun and srun work.
> > > >
> > > >
> > > Thanks to any and all who take the time to shed light on this matter.
> > >
> > >
> > > --
> > > E.M. (Em) Dragowsky, Ph.D.
> > > Research Computing -- UTech
> > > Case Western Reserve University
> > > (216) 368-0082
> > > they/them
> >
> >
>
> --
> E.M. (Em) Dragowsky, Ph.D.
> Research Computing -- UTech
> Case Western Reserve University
> (216) 368-0082
> they/them
--
Jürgen Salk
Scientific Software & Compute Services (SSCS)
Kommunikations- und Informationszentrum (kiz)
Universität Ulm
Telefon: +49 (0)731 50-22478
Telefax: +49 (0)731 50-22471
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 4965 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221103/9abdbe4f/attachment-0001.bin>
More information about the slurm-users
mailing list