[slurm-users] Submitting jobs across multiple nodes fails
Andrej Prsa
aprsa09 at gmail.com
Tue Feb 2 05:14:36 UTC 2021
Dear list,
I'm struggling with what seems to be very similar to this thread:
https://lists.schedmd.com/pipermail/slurm-users/2019-July/003746.html
I'm using slurm 20.11.3 patched with this fix to detect pmixv4:
https://bugs.schedmd.com/show_bug.cgi?id=10683
and this is what I'm seeing:
andrej at terra:~$ salloc -N 2 -n 2
salloc: Granted job allocation 841
andrej at terra:~$ srun hostname
srun: launch/slurm: launch_p_step_launch: StepId=841.0 aborted before
step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: task 0 launch failed: Unspecified error
srun: error: task 1 launch failed: Unspecified error
In slurmctld.log I have this:
[2021-02-01T23:58:13.683] sched: _slurm_rpc_allocate_resources JobId=841
NodeList=node[9-10] usec=572
[2021-02-01T23:58:19.817] error: mpi_hook_slurmstepd_prefork failure for
0x557e7480bcb0s on node9
[2021-02-01T23:58:19.829] error: mpi_hook_slurmstepd_prefork failure for
0x55f568e00cb0s on node10
and in slurmd.log I have this for node9:
[2021-02-01T23:58:19.788] launch task StepId=841.0 request from UID:1000
GID:1000 HOST:192.168.1.1 PORT:35508
[2021-02-01T23:58:19.789] task/affinity: lllp_distribution: JobId=841
implicit auto binding: cores, dist 1
[2021-02-01T23:58:19.789] task/affinity: _task_layout_lllp_cyclic:
_task_layout_lllp_cyclic
[2021-02-01T23:58:19.789] task/affinity: _lllp_generate_cpu_bind:
_lllp_generate_cpu_bind jobid [841]: mask_cpu, 0x000000000001000000000001
[2021-02-01T23:58:19.814] [841.0] error: node9 [0] pmixp_utils.c:108
[pmixp_usock_create_srv] mpi/pmix: ERROR: Cannot bind() UNIX socket
/var/spool/slurmd/stepd.slurm.pmix.841.0: Address already in use (98)
[2021-02-01T23:58:19.814] [841.0] error: node9 [0] pmixp_server.c:387
[pmixp_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv
[2021-02-01T23:58:19.814] [841.0] error: (null) [0] mpi_pmix.c:169
[p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed
[2021-02-01T23:58:19.817] [841.0] error: Failed mpi_hook_slurmstepd_prefork
[2021-02-01T23:58:19.845] [841.0] error: job_manager exiting abnormally,
rc = -1
[2021-02-01T23:58:19.892] [841.0] done with job
and this for node10:
[2021-02-01T23:58:19.788] launch task StepId=841.0 request from UID:1000
GID:1000 HOST:192.168.1.1 PORT:38918
[2021-02-01T23:58:19.789] task/affinity: lllp_distribution: JobId=841
implicit auto binding: cores, dist 1
[2021-02-01T23:58:19.789] task/affinity: _task_layout_lllp_cyclic:
_task_layout_lllp_cyclic
[2021-02-01T23:58:19.789] task/affinity: _lllp_generate_cpu_bind:
_lllp_generate_cpu_bind jobid [841]: mask_cpu, 0x000000000001000000000001
[2021-02-01T23:58:19.825] [841.0] error: node10 [1]
pmixp_client_v2.c:246 [pmixp_lib_init] mpi/pmix: ERROR: PMIx_server_init
failed with error -2
: Success (0)
[2021-02-01T23:58:19.826] [841.0] error: node10 [1] pmixp_client.c:518
[pmixp_libpmix_init] mpi/pmix: ERROR: PMIx_server_init failed with error -1
: Success (0)
[2021-02-01T23:58:19.826] [841.0] error: node10 [1] pmixp_server.c:423
[pmixp_stepd_init] mpi/pmix: ERROR: pmixp_libpmix_init() failed
[2021-02-01T23:58:19.826] [841.0] error: (null) [1] mpi_pmix.c:169
[p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed
[2021-02-01T23:58:19.829] [841.0] error: Failed mpi_hook_slurmstepd_prefork
[2021-02-01T23:58:19.853] [841.0] error: job_manager exiting abnormally,
rc = -1
[2021-02-01T23:58:19.899] [841.0] done with job
It seems that the culprit is the bind() failure, but I can't make much
sense of it. I checked that /etc/hosts has everything correct and
consistent with the info in slurm.conf.
Other potentially relevant info: all compute nodes are diskless, they
are pxe-booted from a NAS image and running ubuntu server 20.04. Running
jobs on a single node is fine.
Thanks for any insight and suggestions.
Cheers,
Andrej
More information about the slurm-users
mailing list