[slurm-users] Submitting jobs across multiple nodes fails

Tue Feb 2 05:14:36 UTC 2021

Dear list,

I'm struggling with what seems to be very similar to this thread:

https://lists.schedmd.com/pipermail/slurm-users/2019-July/003746.html

I'm using slurm 20.11.3 patched with this fix to detect pmixv4:

     https://bugs.schedmd.com/show_bug.cgi?id=10683

and this is what I'm seeing:

andrej at terra:~$ salloc -N 2 -n 2
salloc: Granted job allocation 841
andrej at terra:~$ srun hostname
srun: launch/slurm: launch_p_step_launch: StepId=841.0 aborted before 
step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: task 0 launch failed: Unspecified error
srun: error: task 1 launch failed: Unspecified error

In slurmctld.log I have this:

[2021-02-01T23:58:13.683] sched: _slurm_rpc_allocate_resources JobId=841 
NodeList=node[9-10] usec=572
[2021-02-01T23:58:19.817] error: mpi_hook_slurmstepd_prefork failure for 
0x557e7480bcb0s on node9
[2021-02-01T23:58:19.829] error: mpi_hook_slurmstepd_prefork failure for 
0x55f568e00cb0s on node10

and in slurmd.log I have this for node9:

[2021-02-01T23:58:19.788] launch task StepId=841.0 request from UID:1000 
GID:1000 HOST:192.168.1.1 PORT:35508
[2021-02-01T23:58:19.789] task/affinity: lllp_distribution: JobId=841 
implicit auto binding: cores, dist 1
[2021-02-01T23:58:19.789] task/affinity: _task_layout_lllp_cyclic: 
_task_layout_lllp_cyclic
[2021-02-01T23:58:19.789] task/affinity: _lllp_generate_cpu_bind: 
_lllp_generate_cpu_bind jobid [841]: mask_cpu, 0x000000000001000000000001
[2021-02-01T23:58:19.814] [841.0] error: node9 [0] pmixp_utils.c:108 
[pmixp_usock_create_srv] mpi/pmix: ERROR: Cannot bind() UNIX socket 
/var/spool/slurmd/stepd.slurm.pmix.841.0: Address already in use (98)
[2021-02-01T23:58:19.814] [841.0] error: node9 [0] pmixp_server.c:387 
[pmixp_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv
[2021-02-01T23:58:19.814] [841.0] error: (null) [0] mpi_pmix.c:169 
[p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed
[2021-02-01T23:58:19.817] [841.0] error: Failed mpi_hook_slurmstepd_prefork
[2021-02-01T23:58:19.845] [841.0] error: job_manager exiting abnormally, 
rc = -1
[2021-02-01T23:58:19.892] [841.0] done with job

and this for node10:

[2021-02-01T23:58:19.788] launch task StepId=841.0 request from UID:1000 
GID:1000 HOST:192.168.1.1 PORT:38918
[2021-02-01T23:58:19.789] task/affinity: lllp_distribution: JobId=841 
implicit auto binding: cores, dist 1
[2021-02-01T23:58:19.789] task/affinity: _task_layout_lllp_cyclic: 
_task_layout_lllp_cyclic
[2021-02-01T23:58:19.789] task/affinity: _lllp_generate_cpu_bind: 
_lllp_generate_cpu_bind jobid [841]: mask_cpu, 0x000000000001000000000001
[2021-02-01T23:58:19.825] [841.0] error: node10 [1] 
pmixp_client_v2.c:246 [pmixp_lib_init] mpi/pmix: ERROR: PMIx_server_init 
failed with error -2
: Success (0)
[2021-02-01T23:58:19.826] [841.0] error: node10 [1] pmixp_client.c:518 
[pmixp_libpmix_init] mpi/pmix: ERROR: PMIx_server_init failed with error -1
: Success (0)
[2021-02-01T23:58:19.826] [841.0] error: node10 [1] pmixp_server.c:423 
[pmixp_stepd_init] mpi/pmix: ERROR: pmixp_libpmix_init() failed
[2021-02-01T23:58:19.826] [841.0] error: (null) [1] mpi_pmix.c:169 
[p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed
[2021-02-01T23:58:19.829] [841.0] error: Failed mpi_hook_slurmstepd_prefork
[2021-02-01T23:58:19.853] [841.0] error: job_manager exiting abnormally, 
rc = -1
[2021-02-01T23:58:19.899] [841.0] done with job

It seems that the culprit is the bind() failure, but I can't make much 
sense of it. I checked that /etc/hosts has everything correct and 
consistent with the info in slurm.conf.

Other potentially relevant info: all compute nodes are diskless, they 
are pxe-booted from a NAS image and running ubuntu server 20.04. Running 
jobs on a single node is fine.

Thanks for any insight and suggestions.

Cheers,
Andrej