[slurm-users] Failed to launch jobs with mpirun after upgrading to Slurm 19.05

Andrés Marín Díaz amarin at cesvima.upm.es
Thu Jun 6 13:58:27 UTC 2019


Hello,

We have tried to compile it in 2 ways, in principle we had compiled it 
with pmix in the following way:
rpmbuild -ta slurm-19.05.0.tar.bz2 --define = '_ with_pmix --with-pmix = 
/ opt / pmix / 3.1.2 /'

But we have also tried compiling it without pmix:
rpmbuild -ta slurm-19.05.0.tar.bz2

In both cases the result is the same.

In the slurm.conf we have defined:
MpiDefault = pmi2

Thank you,
A greeting.


El 06/06/2019 a las 13:27, Sean Crosby escribió:
> How did you compile SLURM? Did you add the contribs/pmi and/or 
> contribs/pmi2 plugins to the install? Or did you use PMIx?
>
> Sean
>
> --
> Sean Crosby
> Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services
> Research Computing | CoEPP | School of Physics
> University of Melbourne
>
>
> On Thu, 6 Jun 2019 at 21:11, Andrés Marín Díaz <amarin at cesvima.upm.es 
> <mailto:amarin at cesvima.upm.es>> wrote:
>
>     Hello,
>
>     Yes, we have recompiled OpenMPI with integration with SLURM 19.05
>     but the problem remains.
>
>     We have also tried to recompile OpenMPI without integration with
>     SLURM. In this case executions fail with srun, but with mpirun it
>     continues to work in SLURM 18.08 and fails in 19.05 in the same way.
>
>     Thank you very much once more.
>
>
>     El 06/06/2019 a las 12:46, Sean Crosby escribió:
>>     Hi Andrés,
>>
>>     Did you recompile OpenMPI after updating to SLURM 19.05?
>>
>>     Sean
>>
>>     --
>>     Sean Crosby
>>     Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform
>>     Services
>>     Research Computing | CoEPP | School of Physics
>>     University of Melbourne
>>
>>
>>     On Thu, 6 Jun 2019 at 20:11, Andrés Marín Díaz
>>     <amarin at cesvima.upm.es <mailto:amarin at cesvima.upm.es>> wrote:
>>
>>         Thank you very much for the help, I update some information.
>>
>>         - If we use Intel MPI (IMPI) mpirun it works correctly.
>>         - If we use mpirun without using the scheduler it works
>>         correctly.
>>         - If we use srun with software compiled with OpenMPI it works
>>         correctly.
>>         - If we use SLURM 18.08.6 it works correctly.
>>         - If we use SLURM 19.05.0 and mpirun inside the sbatch script
>>         then we
>>         get the error:
>>         --------------------------------------------------------------------------
>>              An ORTE daemon has unexpectedly failed after launch and
>>         before
>>              communicating back to mpirun. This could be caused by a
>>         number
>>              of factors, including an inability to create a
>>         connection back
>>              to mpirun due to lack of common network interfaces and /
>>         or no
>>              route found between them. Please check network connectivity
>>              (including firewalls and network routing requirements).
>>         --------------------------------------------------------------------------
>>
>>         Trying to trace the problem:
>>         - Mpirun is a binary and can not be traced with batch -x.
>>         - I've done a "strace mpirun hostname" to see if it helps,
>>         but i am not
>>         able to see where the problem may be.
>>
>>         Here is the exit from the strace:
>>         https://cloud.cesvima.upm.es/index.php/s/hWQMkwU5zW7J8RW
>>
>>         And here the slurmd log with verbose level 5:
>>         Main node (slurmd log):
>>              2019-06-06T09:51:54.255743+00:00 r1n1 slurmd[108517]:
>>         _run_prolog:
>>         run job script took usec=7
>>              2019-06-06T09:51:54.256118+00:00 r1n1 slurmd[108517]:
>>         _run_prolog:
>>         prolog with lock for job 11057 ran for 0 seconds
>>              2019-06-06T09:51:54.258887+00:00 r1n1 slurmd[108517]:
>>         task_p_slurmd_batch_request: 11057
>>              2019-06-06T09:51:54.259317+00:00 r1n1 slurmd[108517]:
>>         task/affinity: job 11057 CPU input mask for node: 0x0000000001
>>              2019-06-06T09:51:54.259680+00:00 r1n1 slurmd[108517]:
>>         task/affinity: job 11057 CPU final HW mask for node: 0x0000000001
>>              2019-06-06T09:51:54.279614+00:00 r1n1
>>         slurmstepd[108548]: task
>>         affinity plugin loaded with CPU mask
>>         000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff
>>              2019-06-06T09:51:54.280312+00:00 r1n1
>>         slurmstepd[108548]: Munge
>>         credential signature plugin loaded
>>              2019-06-06T09:51:54.302921+00:00 r1n1 slurmstepd[108548]:
>>         task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB
>>         mem.limit=1024MB
>>         memsw.limit=unlimited
>>              2019-06-06T09:51:54.303592+00:00 r1n1 slurmstepd[108548]:
>>         task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB
>>         mem.limit=1024MB memsw.limit=unlimited
>>              2019-06-06T09:51:54.330322+00:00 r1n1 slurmd[108517]:
>>         Launching
>>         batch job 11057 for UID 2000
>>              2019-06-06T09:51:54.353196+00:00 r1n1
>>         slurmstepd[108556]: task
>>         affinity plugin loaded with CPU mask
>>         000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff
>>              2019-06-06T09:51:54.353899+00:00 r1n1
>>         slurmstepd[108556]: Munge
>>         credential signature plugin loaded
>>              2019-06-06T09:51:54.366478+00:00 r1n1 slurmstepd[108556]:
>>         task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB
>>         mem.limit=1024MB
>>         memsw.limit=unlimited
>>              2019-06-06T09:51:54.366755+00:00 r1n1 slurmstepd[108556]:
>>         task/cgroup: /slurm/uid_2000/job_11057/step_batch: alloc=1024MB
>>         mem.limit=1024MB memsw.limit=unlimited
>>              2019-06-06T09:51:54.393325+00:00 r1n1
>>         slurmstepd[108556]: debug
>>         level = 2
>>              2019-06-06T09:51:54.393754+00:00 r1n1
>>         slurmstepd[108556]: starting
>>         1 tasks
>>              2019-06-06T09:51:54.401243+00:00 r1n1
>>         slurmstepd[108556]: task 0
>>         (108561) started 2019-06-06T09:51:54
>>              2019-06-06T09:51:54.416396+00:00 r1n1 slurmstepd[108561]:
>>         task_p_pre_launch: Using sched_affinity for tasks
>>              2019-06-06T09:51:56.514908+00:00 r1n1
>>         slurmstepd[108556]: task 0
>>         (108561) exited with exit code 1.
>>              2019-06-06T09:51:56.554430+00:00 r1n1
>>         slurmstepd[108556]: job 11057
>>         completed with slurm_rc = 0, job_rc = 256
>>              2019-06-06T09:51:56.554847+00:00 r1n1
>>         slurmstepd[108556]: sending
>>         REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
>>              2019-06-06T09:51:56.559856+00:00 r1n1
>>         slurmstepd[108556]: done with job
>>              2019-06-06T09:51:56.596762+00:00 r1n1
>>         slurmstepd[108548]: Sent
>>         signal 18 to 11057.4294967295
>>              2019-06-06T09:51:56.598072+00:00 r1n1
>>         slurmstepd[108548]: Sent
>>         signal 15 to 11057.4294967295
>>              2019-06-06T09:51:56.599141+00:00 r1n1 slurmstepd[108548]:
>>         _oom_event_monitor: oom-kill event count: 1
>>              2019-06-06T09:51:56.641170+00:00 r1n1
>>         slurmstepd[108548]: done with job
>>
>>         Secundary node (slurmd log):
>>              2019-06-06T09:51:54.256047+00:00 r1n2 slurmd[84916]:
>>         _run_prolog:
>>         run job script took usec=7
>>              2019-06-06T09:51:54.256432+00:00 r1n2 slurmd[84916]:
>>         _run_prolog:
>>         prolog with lock for job 11057 ran for 0 seconds
>>              2019-06-06T09:51:54.279763+00:00 r1n2 slurmstepd[84954]:
>>         task
>>         affinity plugin loaded with CPU mask
>>         000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff
>>              2019-06-06T09:51:54.280448+00:00 r1n2 slurmstepd[84954]:
>>         Munge
>>         credential signature plugin loaded
>>              2019-06-06T09:51:54.313852+00:00 r1n2 slurmstepd[84954]:
>>         task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB
>>         mem.limit=1024MB
>>         memsw.limit=unlimited
>>              2019-06-06T09:51:54.314502+00:00 r1n2 slurmstepd[84954]:
>>         task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB
>>         mem.limit=1024MB memsw.limit=unlimited
>>              2019-06-06T09:51:56.564758+00:00 r1n2 slurmstepd[84954]:
>>         Sent
>>         signal 18 to 11057.4294967295
>>              2019-06-06T09:51:56.608028+00:00 r1n2 slurmstepd[84954]:
>>         Sent
>>         signal 15 to 11057.4294967295
>>              2019-06-06T09:51:56.609259+00:00 r1n2 slurmstepd[84954]:
>>         _oom_event_monitor: oom-kill event count: 1
>>              2019-06-06T09:51:56.638334+00:00 r1n2 slurmstepd[84954]:
>>         done with job
>>
>>         Thank you very much again.
>>
>>         -- 
>>           Andrés Marín Díaz
>>
>>           Servicio de Infraestructura e Innovación
>>           Universidad Politécnica de Madrid
>>
>>           Centro de Supercomputación y Visualización de Madrid (CeSViMa)
>>           Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES)
>>         amarin at cesvima.upm.es <mailto:amarin at cesvima.upm.es> | tel
>>         910679676
>>
>>         www.cesvima.upm.es <http://www.cesvima.upm.es> |
>>         www.twitter.com/cesvima <http://www.twitter.com/cesvima> |
>>         www.fb.com/cesvima <http://www.fb.com/cesvima>
>>
>>
>>
>     -- 
>       Andrés Marín Díaz
>       
>       Servicio de Infraestructura e Innovación
>       Universidad Politécnica de Madrid
>       
>       Centro de Supercomputación y Visualización de Madrid (CeSViMa)
>       Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES)
>       amarin at cesvima.upm.es  <mailto:amarin at cesvima.upm.es>  | tel 910679676
>       
>       www.cesvima.upm.es  <http://www.cesvima.upm.es>  |www.twitter.com/cesvima  <http://www.twitter.com/cesvima>  |www.fb.com/cesvima  <http://www.fb.com/cesvima>  
>        
>
-- 
  Andrés Marín Díaz
  
  Servicio de Infraestructura e Innovación
  Universidad Politécnica de Madrid
  
  Centro de Supercomputación y Visualización de Madrid (CeSViMa)
  Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES)
  amarin at cesvima.upm.es | tel 910679676
  
  www.cesvima.upm.es | www.twitter.com/cesvima | www.fb.com/cesvima
   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190606/a6e7491e/attachment-0001.html>


More information about the slurm-users mailing list