[slurm-users] Submitting jobs across multiple nodes fails
    Andrej Prsa 
    aprsa09 at gmail.com
       
    Fri Feb  5 00:55:39 UTC 2021
    
    
  
Hi Brian,
Thanks for your response!
> Did you compile slurm with mpi support?
>
Yep:
andrej at terra:~$ srun --mpi=list
srun: MPI types are...
srun: cray_shasta
srun: none
srun: pmi2
srun: pmix
srun: pmix_v4
> Your mpi libraries should be the same as that version and they should 
> be available in the same locations for all nodes. Also, ensure they 
> are accessible (PATH, LD_LIBRARY_PATH, etc are set)
>
They are: I have openmpi-4.1.0 installed cluster-wide. Running jobs via 
rsh across multiple nodes works just fine, but through slurm they do not 
(within salloc):
mpirun -mca plm rsh -np 384 -H node15:96,node16:96,node17:96,node18:96 
python testmpi.py # works
mpirun -mca plm slurm -np 384 -H node15:96,node16:96,node17:96,node18:96 
python testmpi.py # doesn't work
Thus, I believe that mpi works just fine. I passed this by the 
ompi-devel folks and they are convinced that the issue is in slurm 
configuration. I'm trying to figure out what's causing this error to pop 
up in the logs:
mpi/pmix: ERROR: Cannot bind() UNIX socket 
/var/spool/slurmd/stepd.slurm.pmix.841.0: Address already in use (98)
I wonder if the culprit is how srun calls openmpi's --bind-to?
Thanks again,
Andrej
    
    
More information about the slurm-users
mailing list