[slurm-users] srun and Intel MPI 2020 Update 4

Malte Thoma Malte.Thoma at awi.de
Fri Nov 6 08:30:36 UTC 2020


Hi Ciaron,

on our Omnipath network, we encounterd a simmilar problem:

The MPI needs exclusive access to the interconnect.

Cray once provided a workaround, but that was not worth to implement (terrible efford/gain for us).

Conclusion You might have to live with this limitation.

Kind regards,
Malte



Am 05.11.20 um 16:41 schrieb Ciaron Linstead:
> Hello all
> 
> I've been trying to run a simple MPI application (the Intel MPI Benchmark) using the latest Intel Parallel Studio (2020 Update 4) 
> and srun. Version 2019 Update 4 runs this example correctly, as does mpirun.
> 
> SLURM is 17.11.7
> 
> The error I get is the following, unless I use --exclusive:
> 
> 
> MPI startup(): Could not import some environment variables. Intel MPI process pinning will not be used.
>                 Possible reason: Using the Slurm srun command. In this case, Slurm pinning will be used.
> MPIR_pmi_virtualization(): MPI startup(): PMI calls are forwarded to /p/system/slurm/lib/libpmi.so
> Abort(2664079) on node 19 (rank 19 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
> MPIR_Init_thread(136).........:
> MPID_Init(1127)...............:
> MPIDI_SHMI_mpi_init_hook(29)..:
> MPIDI_POSIX_mpi_init_hook(141):
> MPIDI_POSIX_eager_init(2109)..:
> MPIDU_shm_seg_commit(296).....: unable to allocate shared memory
> 
> 
> 
> I have a ticket open with Intel, who suggested increasing /dev/shm on the nodes to 64GB (the size of the RAM on the nodes), but this 
> had no effect.
> 
> Here's my submit script:
> 
> 
> 
> #!/bin/bash
> 
> #SBATCH --ntasks=25 # fails, unless exclusive
> ##SBATCH --exclusive
> 
> source /p/system/packages/intel/parallel_studio_xe_2020_update4/impi/2019.9.304/intel64/bin/mpivars.sh -ofi_internal=1
> 
> export I_MPI_FABRICS=shm:ofi
> export FI_PROVIDER=verbs
> export FI_VERBS_IFACE=ib0
> export FI_LOG_LEVEL=trace
> 
> export I_MPI_PMI_LIBRARY=/p/system/slurm/lib/libpmi.so
> export I_MPI_DEBUG=5
> 
> # Fails for any MPI program, not just this one
> srun -v -n $SLURM_NTASKS /home/linstead/imb_2019.5/IMB-MPI1 barrier
> 
> 
> 
> Do you have any ideas about where/how to investigate this further?
> 
> Many thanks
> 
> Ciaron
> 

-- 
Malte Thoma        Tel. +49-471-4831-1828
HSM Documentation: https://spaces.awi.de/x/YF3-Eg (User)
                    https://spaces.awi.de/x/oYD8B  (Admin)
HPC Documentation: https://spaces.awi.de/x/Z13-Eg (User)
                    https://spaces.awi.de/x/EgCZB (Admin)
AWI, Geb.E (3125)
Am Handelshafen 12
27570 Bremerhaven
Tel. +49-471-4831-1828



More information about the slurm-users mailing list