[slurm-users] Submitting jobs across multiple nodes fails

Thu Feb 4 21:20:49 UTC 2021

Gentle bump on this, if anyone has suggestions as I weed through the 
scattered slurm docs. :)

Thanks,
Andrej

On February 2, 2021 00:14:37 Andrej Prsa <aprsa09 at gmail.com> wrote:

> Dear list,
>
> I'm struggling with what seems to be very similar to this thread:
>
> https://lists.schedmd.com/pipermail/slurm-users/2019-July/003746.html
>
> I'm using slurm 20.11.3 patched with this fix to detect pmixv4:
>
>     https://bugs.schedmd.com/show_bug.cgi?id=10683
>
> and this is what I'm seeing:
>
> andrej at terra:~$ salloc -N 2 -n 2
> salloc: Granted job allocation 841
> andrej at terra:~$ srun hostname
> srun: launch/slurm: launch_p_step_launch: StepId=841.0 aborted before
> step completely launched.
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: task 0 launch failed: Unspecified error
> srun: error: task 1 launch failed: Unspecified error
>
> In slurmctld.log I have this:
>
> [2021-02-01T23:58:13.683] sched: _slurm_rpc_allocate_resources JobId=841
> NodeList=node[9-10] usec=572
> [2021-02-01T23:58:19.817] error: mpi_hook_slurmstepd_prefork failure for
> 0x557e7480bcb0s on node9
> [2021-02-01T23:58:19.829] error: mpi_hook_slurmstepd_prefork failure for
> 0x55f568e00cb0s on node10
>
> and in slurmd.log I have this for node9:
>
> [2021-02-01T23:58:19.788] launch task StepId=841.0 request from UID:1000
> GID:1000 HOST:192.168.1.1 PORT:35508
> [2021-02-01T23:58:19.789] task/affinity: lllp_distribution: JobId=841
> implicit auto binding: cores, dist 1
> [2021-02-01T23:58:19.789] task/affinity: _task_layout_lllp_cyclic:
> _task_layout_lllp_cyclic
> [2021-02-01T23:58:19.789] task/affinity: _lllp_generate_cpu_bind:
> _lllp_generate_cpu_bind jobid [841]: mask_cpu, 0x000000000001000000000001
> [2021-02-01T23:58:19.814] [841.0] error: node9 [0] pmixp_utils.c:108
> [pmixp_usock_create_srv] mpi/pmix: ERROR: Cannot bind() UNIX socket
> /var/spool/slurmd/stepd.slurm.pmix.841.0: Address already in use (98)
> [2021-02-01T23:58:19.814] [841.0] error: node9 [0] pmixp_server.c:387
> [pmixp_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv
> [2021-02-01T23:58:19.814] [841.0] error: (null) [0] mpi_pmix.c:169
> [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed
> [2021-02-01T23:58:19.817] [841.0] error: Failed mpi_hook_slurmstepd_prefork
> [2021-02-01T23:58:19.845] [841.0] error: job_manager exiting abnormally,
> rc = -1
> [2021-02-01T23:58:19.892] [841.0] done with job
>
> and this for node10:
>
> [2021-02-01T23:58:19.788] launch task StepId=841.0 request from UID:1000
> GID:1000 HOST:192.168.1.1 PORT:38918
> [2021-02-01T23:58:19.789] task/affinity: lllp_distribution: JobId=841
> implicit auto binding: cores, dist 1
> [2021-02-01T23:58:19.789] task/affinity: _task_layout_lllp_cyclic:
> _task_layout_lllp_cyclic
> [2021-02-01T23:58:19.789] task/affinity: _lllp_generate_cpu_bind:
> _lllp_generate_cpu_bind jobid [841]: mask_cpu, 0x000000000001000000000001
> [2021-02-01T23:58:19.825] [841.0] error: node10 [1]
> pmixp_client_v2.c:246 [pmixp_lib_init] mpi/pmix: ERROR: PMIx_server_init
> failed with error -2
> : Success (0)
> [2021-02-01T23:58:19.826] [841.0] error: node10 [1] pmixp_client.c:518
> [pmixp_libpmix_init] mpi/pmix: ERROR: PMIx_server_init failed with error -1
> : Success (0)
> [2021-02-01T23:58:19.826] [841.0] error: node10 [1] pmixp_server.c:423
> [pmixp_stepd_init] mpi/pmix: ERROR: pmixp_libpmix_init() failed
> [2021-02-01T23:58:19.826] [841.0] error: (null) [1] mpi_pmix.c:169
> [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed
> [2021-02-01T23:58:19.829] [841.0] error: Failed mpi_hook_slurmstepd_prefork
> [2021-02-01T23:58:19.853] [841.0] error: job_manager exiting abnormally,
> rc = -1
> [2021-02-01T23:58:19.899] [841.0] done with job
>
> It seems that the culprit is the bind() failure, but I can't make much
> sense of it. I checked that /etc/hosts has everything correct and
> consistent with the info in slurm.conf.
>
> Other potentially relevant info: all compute nodes are diskless, they
> are pxe-booted from a NAS image and running ubuntu server 20.04. Running
> jobs on a single node is fine.
>
> Thanks for any insight and suggestions.
>
> Cheers,
> Andrej

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210204/10977125/attachment.htm>