[slurm-users] Submitting jobs across multiple nodes fails

Thu Feb 4 22:46:36 UTC 2021

Did you compile slurm with mpi support?

Your mpi libraries should be the same as that version and they should be 
available in the same locations for all nodes.
Also, ensure they are accessible (PATH, LD_LIBRARY_PATH, etc are set)

Brian Andrus

On 2/4/2021 1:20 PM, Andrej Prsa wrote:
> Gentle bump on this, if anyone has suggestions as I weed through the 
> scattered slurm docs. :)
>
> Thanks,
> Andrej
>
> On February 2, 2021 00:14:37 Andrej Prsa <aprsa09 at gmail.com> wrote:
>
>> Dear list,
>>
>> I'm struggling with what seems to be very similar to this thread:
>>
>> https://lists.schedmd.com/pipermail/slurm-users/2019-July/003746.html
>>
>> I'm using slurm 20.11.3 patched with this fix to detect pmixv4:
>>
>> https://bugs.schedmd.com/show_bug.cgi?id=10683
>>
>> and this is what I'm seeing:
>>
>> andrej at terra:~$ salloc -N 2 -n 2
>> salloc: Granted job allocation 841
>> andrej at terra:~$ srun hostname
>> srun: launch/slurm: launch_p_step_launch: StepId=841.0 aborted before
>> step completely launched.
>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>> srun: error: task 0 launch failed: Unspecified error
>> srun: error: task 1 launch failed: Unspecified error
>>
>> In slurmctld.log I have this:
>>
>> [2021-02-01T23:58:13.683] sched: _slurm_rpc_allocate_resources JobId=841
>> NodeList=node[9-10] usec=572
>> [2021-02-01T23:58:19.817] error: mpi_hook_slurmstepd_prefork failure for
>> 0x557e7480bcb0s on node9
>> [2021-02-01T23:58:19.829] error: mpi_hook_slurmstepd_prefork failure for
>> 0x55f568e00cb0s on node10
>>
>> and in slurmd.log I have this for node9:
>>
>> [2021-02-01T23:58:19.788] launch task StepId=841.0 request from UID:1000
>> GID:1000 HOST:192.168.1.1 PORT:35508
>> [2021-02-01T23:58:19.789] task/affinity: lllp_distribution: JobId=841
>> implicit auto binding: cores, dist 1
>> [2021-02-01T23:58:19.789] task/affinity: _task_layout_lllp_cyclic:
>> _task_layout_lllp_cyclic
>> [2021-02-01T23:58:19.789] task/affinity: _lllp_generate_cpu_bind:
>> _lllp_generate_cpu_bind jobid [841]: mask_cpu, 0x000000000001000000000001
>> [2021-02-01T23:58:19.814] [841.0] error: node9 [0] pmixp_utils.c:108
>> [pmixp_usock_create_srv] mpi/pmix: ERROR: Cannot bind() UNIX socket
>> /var/spool/slurmd/stepd.slurm.pmix.841.0: Address already in use (98)
>> [2021-02-01T23:58:19.814] [841.0] error: node9 [0] pmixp_server.c:387
>> [pmixp_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv
>> [2021-02-01T23:58:19.814] [841.0] error: (null) [0] mpi_pmix.c:169
>> [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() 
>> failed
>> [2021-02-01T23:58:19.817] [841.0] error: Failed 
>> mpi_hook_slurmstepd_prefork
>> [2021-02-01T23:58:19.845] [841.0] error: job_manager exiting abnormally,
>> rc = -1
>> [2021-02-01T23:58:19.892] [841.0] done with job
>>
>> and this for node10:
>>
>> [2021-02-01T23:58:19.788] launch task StepId=841.0 request from UID:1000
>> GID:1000 HOST:192.168.1.1 PORT:38918
>> [2021-02-01T23:58:19.789] task/affinity: lllp_distribution: JobId=841
>> implicit auto binding: cores, dist 1
>> [2021-02-01T23:58:19.789] task/affinity: _task_layout_lllp_cyclic:
>> _task_layout_lllp_cyclic
>> [2021-02-01T23:58:19.789] task/affinity: _lllp_generate_cpu_bind:
>> _lllp_generate_cpu_bind jobid [841]: mask_cpu, 0x000000000001000000000001
>> [2021-02-01T23:58:19.825] [841.0] error: node10 [1]
>> pmixp_client_v2.c:246 [pmixp_lib_init] mpi/pmix: ERROR: PMIx_server_init
>> failed with error -2
>> : Success (0)
>> [2021-02-01T23:58:19.826] [841.0] error: node10 [1] pmixp_client.c:518
>> [pmixp_libpmix_init] mpi/pmix: ERROR: PMIx_server_init failed with 
>> error -1
>> : Success (0)
>> [2021-02-01T23:58:19.826] [841.0] error: node10 [1] pmixp_server.c:423
>> [pmixp_stepd_init] mpi/pmix: ERROR: pmixp_libpmix_init() failed
>> [2021-02-01T23:58:19.826] [841.0] error: (null) [1] mpi_pmix.c:169
>> [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() 
>> failed
>> [2021-02-01T23:58:19.829] [841.0] error: Failed 
>> mpi_hook_slurmstepd_prefork
>> [2021-02-01T23:58:19.853] [841.0] error: job_manager exiting abnormally,
>> rc = -1
>> [2021-02-01T23:58:19.899] [841.0] done with job
>>
>> It seems that the culprit is the bind() failure, but I can't make much
>> sense of it. I checked that /etc/hosts has everything correct and
>> consistent with the info in slurm.conf.
>>
>> Other potentially relevant info: all compute nodes are diskless, they
>> are pxe-booted from a NAS image and running ubuntu server 20.04. Running
>> jobs on a single node is fine.
>>
>> Thanks for any insight and suggestions.
>>
>> Cheers,
>> Andrej
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210204/06dbfc1e/attachment-0001.htm>