[slurm-users] Submitting jobs across multiple nodes fails

Fri Feb 5 02:34:12 UTC 2021

try:

export SLURM_OVERLAP=1
export SLURM_WHOLE=1

before your salloc and see if that helps. I have seen some mpi issues 
that were resolved with that.

You can also try it using just the regular mpirun on the nodes 
allocated. That will help with a datapoint as well.

Brian Andrus

On 2/4/2021 4:55 PM, Andrej Prsa wrote:
> Hi Brian,
>
> Thanks for your response!
>
>> Did you compile slurm with mpi support?
>>
>
> Yep:
>
> andrej at terra:~$ srun --mpi=list
> srun: MPI types are...
> srun: cray_shasta
> srun: none
> srun: pmi2
> srun: pmix
> srun: pmix_v4
>
>> Your mpi libraries should be the same as that version and they should 
>> be available in the same locations for all nodes. Also, ensure they 
>> are accessible (PATH, LD_LIBRARY_PATH, etc are set)
>>
>
> They are: I have openmpi-4.1.0 installed cluster-wide. Running jobs 
> via rsh across multiple nodes works just fine, but through slurm they 
> do not (within salloc):
>
> mpirun -mca plm rsh -np 384 -H node15:96,node16:96,node17:96,node18:96 
> python testmpi.py # works
> mpirun -mca plm slurm -np 384 -H 
> node15:96,node16:96,node17:96,node18:96 python testmpi.py # doesn't work
>
> Thus, I believe that mpi works just fine. I passed this by the 
> ompi-devel folks and they are convinced that the issue is in slurm 
> configuration. I'm trying to figure out what's causing this error to 
> pop up in the logs:
>
> mpi/pmix: ERROR: Cannot bind() UNIX socket 
> /var/spool/slurmd/stepd.slurm.pmix.841.0: Address already in use (98)
>
> I wonder if the culprit is how srun calls openmpi's --bind-to?
>
> Thanks again,
> Andrej
>
>