[slurm-users] Failed to launch jobs with mpirun after upgrading to Slurm 19.05
Andrés Marín Díaz
amarin at cesvima.upm.es
Thu Jun 6 13:58:27 UTC 2019
Hello,
We have tried to compile it in 2 ways, in principle we had compiled it
with pmix in the following way:
rpmbuild -ta slurm-19.05.0.tar.bz2 --define = '_ with_pmix --with-pmix =
/ opt / pmix / 3.1.2 /'
But we have also tried compiling it without pmix:
rpmbuild -ta slurm-19.05.0.tar.bz2
In both cases the result is the same.
In the slurm.conf we have defined:
MpiDefault = pmi2
Thank you,
A greeting.
El 06/06/2019 a las 13:27, Sean Crosby escribió:
> How did you compile SLURM? Did you add the contribs/pmi and/or
> contribs/pmi2 plugins to the install? Or did you use PMIx?
>
> Sean
>
> --
> Sean Crosby
> Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services
> Research Computing | CoEPP | School of Physics
> University of Melbourne
>
>
> On Thu, 6 Jun 2019 at 21:11, Andrés Marín Díaz <amarin at cesvima.upm.es
> <mailto:amarin at cesvima.upm.es>> wrote:
>
> Hello,
>
> Yes, we have recompiled OpenMPI with integration with SLURM 19.05
> but the problem remains.
>
> We have also tried to recompile OpenMPI without integration with
> SLURM. In this case executions fail with srun, but with mpirun it
> continues to work in SLURM 18.08 and fails in 19.05 in the same way.
>
> Thank you very much once more.
>
>
> El 06/06/2019 a las 12:46, Sean Crosby escribió:
>> Hi Andrés,
>>
>> Did you recompile OpenMPI after updating to SLURM 19.05?
>>
>> Sean
>>
>> --
>> Sean Crosby
>> Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform
>> Services
>> Research Computing | CoEPP | School of Physics
>> University of Melbourne
>>
>>
>> On Thu, 6 Jun 2019 at 20:11, Andrés Marín Díaz
>> <amarin at cesvima.upm.es <mailto:amarin at cesvima.upm.es>> wrote:
>>
>> Thank you very much for the help, I update some information.
>>
>> - If we use Intel MPI (IMPI) mpirun it works correctly.
>> - If we use mpirun without using the scheduler it works
>> correctly.
>> - If we use srun with software compiled with OpenMPI it works
>> correctly.
>> - If we use SLURM 18.08.6 it works correctly.
>> - If we use SLURM 19.05.0 and mpirun inside the sbatch script
>> then we
>> get the error:
>> --------------------------------------------------------------------------
>> An ORTE daemon has unexpectedly failed after launch and
>> before
>> communicating back to mpirun. This could be caused by a
>> number
>> of factors, including an inability to create a
>> connection back
>> to mpirun due to lack of common network interfaces and /
>> or no
>> route found between them. Please check network connectivity
>> (including firewalls and network routing requirements).
>> --------------------------------------------------------------------------
>>
>> Trying to trace the problem:
>> - Mpirun is a binary and can not be traced with batch -x.
>> - I've done a "strace mpirun hostname" to see if it helps,
>> but i am not
>> able to see where the problem may be.
>>
>> Here is the exit from the strace:
>> https://cloud.cesvima.upm.es/index.php/s/hWQMkwU5zW7J8RW
>>
>> And here the slurmd log with verbose level 5:
>> Main node (slurmd log):
>> 2019-06-06T09:51:54.255743+00:00 r1n1 slurmd[108517]:
>> _run_prolog:
>> run job script took usec=7
>> 2019-06-06T09:51:54.256118+00:00 r1n1 slurmd[108517]:
>> _run_prolog:
>> prolog with lock for job 11057 ran for 0 seconds
>> 2019-06-06T09:51:54.258887+00:00 r1n1 slurmd[108517]:
>> task_p_slurmd_batch_request: 11057
>> 2019-06-06T09:51:54.259317+00:00 r1n1 slurmd[108517]:
>> task/affinity: job 11057 CPU input mask for node: 0x0000000001
>> 2019-06-06T09:51:54.259680+00:00 r1n1 slurmd[108517]:
>> task/affinity: job 11057 CPU final HW mask for node: 0x0000000001
>> 2019-06-06T09:51:54.279614+00:00 r1n1
>> slurmstepd[108548]: task
>> affinity plugin loaded with CPU mask
>> 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff
>> 2019-06-06T09:51:54.280312+00:00 r1n1
>> slurmstepd[108548]: Munge
>> credential signature plugin loaded
>> 2019-06-06T09:51:54.302921+00:00 r1n1 slurmstepd[108548]:
>> task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB
>> mem.limit=1024MB
>> memsw.limit=unlimited
>> 2019-06-06T09:51:54.303592+00:00 r1n1 slurmstepd[108548]:
>> task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB
>> mem.limit=1024MB memsw.limit=unlimited
>> 2019-06-06T09:51:54.330322+00:00 r1n1 slurmd[108517]:
>> Launching
>> batch job 11057 for UID 2000
>> 2019-06-06T09:51:54.353196+00:00 r1n1
>> slurmstepd[108556]: task
>> affinity plugin loaded with CPU mask
>> 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff
>> 2019-06-06T09:51:54.353899+00:00 r1n1
>> slurmstepd[108556]: Munge
>> credential signature plugin loaded
>> 2019-06-06T09:51:54.366478+00:00 r1n1 slurmstepd[108556]:
>> task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB
>> mem.limit=1024MB
>> memsw.limit=unlimited
>> 2019-06-06T09:51:54.366755+00:00 r1n1 slurmstepd[108556]:
>> task/cgroup: /slurm/uid_2000/job_11057/step_batch: alloc=1024MB
>> mem.limit=1024MB memsw.limit=unlimited
>> 2019-06-06T09:51:54.393325+00:00 r1n1
>> slurmstepd[108556]: debug
>> level = 2
>> 2019-06-06T09:51:54.393754+00:00 r1n1
>> slurmstepd[108556]: starting
>> 1 tasks
>> 2019-06-06T09:51:54.401243+00:00 r1n1
>> slurmstepd[108556]: task 0
>> (108561) started 2019-06-06T09:51:54
>> 2019-06-06T09:51:54.416396+00:00 r1n1 slurmstepd[108561]:
>> task_p_pre_launch: Using sched_affinity for tasks
>> 2019-06-06T09:51:56.514908+00:00 r1n1
>> slurmstepd[108556]: task 0
>> (108561) exited with exit code 1.
>> 2019-06-06T09:51:56.554430+00:00 r1n1
>> slurmstepd[108556]: job 11057
>> completed with slurm_rc = 0, job_rc = 256
>> 2019-06-06T09:51:56.554847+00:00 r1n1
>> slurmstepd[108556]: sending
>> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
>> 2019-06-06T09:51:56.559856+00:00 r1n1
>> slurmstepd[108556]: done with job
>> 2019-06-06T09:51:56.596762+00:00 r1n1
>> slurmstepd[108548]: Sent
>> signal 18 to 11057.4294967295
>> 2019-06-06T09:51:56.598072+00:00 r1n1
>> slurmstepd[108548]: Sent
>> signal 15 to 11057.4294967295
>> 2019-06-06T09:51:56.599141+00:00 r1n1 slurmstepd[108548]:
>> _oom_event_monitor: oom-kill event count: 1
>> 2019-06-06T09:51:56.641170+00:00 r1n1
>> slurmstepd[108548]: done with job
>>
>> Secundary node (slurmd log):
>> 2019-06-06T09:51:54.256047+00:00 r1n2 slurmd[84916]:
>> _run_prolog:
>> run job script took usec=7
>> 2019-06-06T09:51:54.256432+00:00 r1n2 slurmd[84916]:
>> _run_prolog:
>> prolog with lock for job 11057 ran for 0 seconds
>> 2019-06-06T09:51:54.279763+00:00 r1n2 slurmstepd[84954]:
>> task
>> affinity plugin loaded with CPU mask
>> 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff
>> 2019-06-06T09:51:54.280448+00:00 r1n2 slurmstepd[84954]:
>> Munge
>> credential signature plugin loaded
>> 2019-06-06T09:51:54.313852+00:00 r1n2 slurmstepd[84954]:
>> task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB
>> mem.limit=1024MB
>> memsw.limit=unlimited
>> 2019-06-06T09:51:54.314502+00:00 r1n2 slurmstepd[84954]:
>> task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB
>> mem.limit=1024MB memsw.limit=unlimited
>> 2019-06-06T09:51:56.564758+00:00 r1n2 slurmstepd[84954]:
>> Sent
>> signal 18 to 11057.4294967295
>> 2019-06-06T09:51:56.608028+00:00 r1n2 slurmstepd[84954]:
>> Sent
>> signal 15 to 11057.4294967295
>> 2019-06-06T09:51:56.609259+00:00 r1n2 slurmstepd[84954]:
>> _oom_event_monitor: oom-kill event count: 1
>> 2019-06-06T09:51:56.638334+00:00 r1n2 slurmstepd[84954]:
>> done with job
>>
>> Thank you very much again.
>>
>> --
>> Andrés Marín Díaz
>>
>> Servicio de Infraestructura e Innovación
>> Universidad Politécnica de Madrid
>>
>> Centro de Supercomputación y Visualización de Madrid (CeSViMa)
>> Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES)
>> amarin at cesvima.upm.es <mailto:amarin at cesvima.upm.es> | tel
>> 910679676
>>
>> www.cesvima.upm.es <http://www.cesvima.upm.es> |
>> www.twitter.com/cesvima <http://www.twitter.com/cesvima> |
>> www.fb.com/cesvima <http://www.fb.com/cesvima>
>>
>>
>>
> --
> Andrés Marín Díaz
>
> Servicio de Infraestructura e Innovación
> Universidad Politécnica de Madrid
>
> Centro de Supercomputación y Visualización de Madrid (CeSViMa)
> Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES)
> amarin at cesvima.upm.es <mailto:amarin at cesvima.upm.es> | tel 910679676
>
> www.cesvima.upm.es <http://www.cesvima.upm.es> |www.twitter.com/cesvima <http://www.twitter.com/cesvima> |www.fb.com/cesvima <http://www.fb.com/cesvima>
>
>
--
Andrés Marín Díaz
Servicio de Infraestructura e Innovación
Universidad Politécnica de Madrid
Centro de Supercomputación y Visualización de Madrid (CeSViMa)
Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES)
amarin at cesvima.upm.es | tel 910679676
www.cesvima.upm.es | www.twitter.com/cesvima | www.fb.com/cesvima
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190606/a6e7491e/attachment-0001.html>
More information about the slurm-users
mailing list