[slurm-users] [EXTERNAL] --no-alloc breaks mpi?

Pritchard Jr., Howard howardp at lanl.gov
Mon Mar 8 21:35:01 UTC 2021

Hi Chris,

What’s happening is that there’s no SLURM_JOBID (my speculation since I don’t have perms to use –no-alloc) is set, but SLURM_NODELIST may be set, so its confusing ORTE.
Could you list which SLURM env variables are set in the shell in which your running the srun command?


From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of "O'Grady, Paul Christopher" <cpo at slac.stanford.edu>
Reply-To: Slurm User Community List <slurm-users at lists.schedmd.com>
Date: Monday, March 8, 2021 at 2:09 PM
To: "slurm-users at lists.schedmd.com" <slurm-users at lists.schedmd.com>
Subject: [EXTERNAL] [slurm-users] --no-alloc breaks mpi?


I’m having an issue with srun's --no-alloc flag with mpi which I can reproduce with a fairly simple example.  When I run a simple one-core mpi test program as “slurmUser” (the account that has the --no-alloc privilege) it succeeds:

srun -p psfehq -n 1 -o logs/test.log -w psana1507 python ~/ipsana/mpi_simpletest.py

However when I add the --no-alloc flag it fails in a way that appears to break mpi (see logfile output and other slurm/mpi info below).  It fails similarly on 2 cores.

srun --no-alloc -p psfehq -n 1 -o logs/test.log -w psana1507 python ~/ipsana/mpi_simpletest.py
srun: do not allocate resources
srun: error: psana1507: task 0: Exited with exit code 1

Would anyone have any suggestions for how I could make the “--no-alloc” flag work with mpi?  Thanks!



Logfile error with --no-alloc flag:

(ana-4.0.12) psanagpu105:batchtest_slurm$ more logs/test.log
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM support. This usually happens
when OMPI was not configured --with-slurm and we weren't able
to discover a SLURM installation in the usual places.

Please configure as appropriate and try again.
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[psana1507:13884] Local abort before MPI_INIT completed completed successfully,
but am not able to aggregate error messages, and not able to guarantee that all
other processes were killed!
(ana-4.0.12) psanagpu105:batchtest_slurm$

System information:

(ana-4.0.12) psanagpu105:batchtest_slurm$ conda list | grep mpi
mpi                       1.0                     openmpi    conda-forge
mpi4py                    3.0.3            py27h9ab638b_1    conda-forge
openmpi                   4.1.0                h9b22176_1    conda-forge

(ana-4.0.12) psanagpu105:batchtest_slurm$ srun --mpi=list
srun: MPI types are...
srun: cray_shasta
srun: none
srun: pmi2
srun: pmix
srun: pmix_v3
(ana-4.0.12) psanagpu105:batchtest_slurm$ srun --version
slurm 20.11.3
(ana-4.0.12) psanagpu105:batchtest_slurm$

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210308/30e17c6c/attachment-0001.htm>

More information about the slurm-users mailing list