[slurm-users] switch to cgroups

Thu Jun 6 15:42:32 UTC 2019

Anyone know what would happen to running jobs if we switch to cgroups? We missed getting this set when we had a general cluster shutdown and want to get it set but do have running jobs at the moment. Thanks

Deborah Crocker, PhD
Systems Engineer III
Office of Information Technology
The University of Alabama
Box 870346
Tuscaloosa, AL 36587
Office 205-348-3758 | Fax 205-348-9393
deborah.crocker at ua.edu<mailto:deborah.crocker at ua.edu>

From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Andrés Marín Díaz
Sent: Thursday, June 6, 2019 8:58 AM
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] Failed to launch jobs with mpirun after upgrading to Slurm 19.05

Hello,

We have tried to compile it in 2 ways, in principle we had compiled it with pmix in the following way:
rpmbuild -ta slurm-19.05.0.tar.bz2 --define = '_ with_pmix --with-pmix = / opt / pmix / 3.1.2 /'

But we have also tried compiling it without pmix:
rpmbuild -ta slurm-19.05.0.tar.bz2

In both cases the result is the same.

In the slurm.conf we have defined:
MpiDefault = pmi2

Thank you,
A greeting.

El 06/06/2019 a las 13:27, Sean Crosby escribió:
How did you compile SLURM? Did you add the contribs/pmi and/or contribs/pmi2 plugins to the install? Or did you use PMIx?

Sean

--
Sean Crosby
Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services
Research Computing | CoEPP | School of Physics
University of Melbourne

On Thu, 6 Jun 2019 at 21:11, Andrés Marín Díaz <amarin at cesvima.upm.es<mailto:amarin at cesvima.upm.es>> wrote:

Hello,

Yes, we have recompiled OpenMPI with integration with SLURM 19.05 but the problem remains.

We have also tried to recompile OpenMPI without integration with SLURM. In this case executions fail with srun, but with mpirun it continues to work in SLURM 18.08 and fails in 19.05 in the same way.

Thank you very much once more.

El 06/06/2019 a las 12:46, Sean Crosby escribió:
Hi Andrés,

Did you recompile OpenMPI after updating to SLURM 19.05?

Sean

--
Sean Crosby
Senior DevOpsHPC Engineer and HPC Team Lead | Research Platform Services
Research Computing | CoEPP | School of Physics
University of Melbourne

On Thu, 6 Jun 2019 at 20:11, Andrés Marín Díaz <amarin at cesvima.upm.es<mailto:amarin at cesvima.upm.es>> wrote:
Thank you very much for the help, I update some information.

- If we use Intel MPI (IMPI) mpirun it works correctly.
- If we use mpirun without using the scheduler it works correctly.
- If we use srun with software compiled with OpenMPI it works correctly.
- If we use SLURM 18.08.6 it works correctly.
- If we use SLURM 19.05.0 and mpirun inside the sbatch script then we
get the error:
--------------------------------------------------------------------------
     An ORTE daemon has unexpectedly failed after launch and before
     communicating back to mpirun. This could be caused by a number
     of factors, including an inability to create a connection back
     to mpirun due to lack of common network interfaces and / or no
     route found between them. Please check network connectivity
     (including firewalls and network routing requirements).
--------------------------------------------------------------------------

Trying to trace the problem:
- Mpirun is a binary and can not be traced with batch -x.
- I've done a "strace mpirun hostname" to see if it helps, but i am not
able to see where the problem may be.

Here is the exit from the strace:
https://cloud.cesvima.upm.es/index.php/s/hWQMkwU5zW7J8RW

And here the slurmd log with verbose level 5:
Main node (slurmd log):
     2019-06-06T09:51:54.255743+00:00 r1n1 slurmd[108517]: _run_prolog:
run job script took usec=7
     2019-06-06T09:51:54.256118+00:00 r1n1 slurmd[108517]: _run_prolog:
prolog with lock for job 11057 ran for 0 seconds
     2019-06-06T09:51:54.258887+00:00 r1n1 slurmd[108517]:
task_p_slurmd_batch_request: 11057
     2019-06-06T09:51:54.259317+00:00 r1n1 slurmd[108517]:
task/affinity: job 11057 CPU input mask for node: 0x0000000001
     2019-06-06T09:51:54.259680+00:00 r1n1 slurmd[108517]:
task/affinity: job 11057 CPU final HW mask for node: 0x0000000001
     2019-06-06T09:51:54.279614+00:00 r1n1 slurmstepd[108548]: task
affinity plugin loaded with CPU mask
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff
     2019-06-06T09:51:54.280312+00:00 r1n1 slurmstepd[108548]: Munge
credential signature plugin loaded
     2019-06-06T09:51:54.302921+00:00 r1n1 slurmstepd[108548]:
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB
memsw.limit=unlimited
     2019-06-06T09:51:54.303592+00:00 r1n1 slurmstepd[108548]:
task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB
mem.limit=1024MB memsw.limit=unlimited
     2019-06-06T09:51:54.330322+00:00 r1n1 slurmd[108517]: Launching
batch job 11057 for UID 2000
     2019-06-06T09:51:54.353196+00:00 r1n1 slurmstepd[108556]: task
affinity plugin loaded with CPU mask
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff
     2019-06-06T09:51:54.353899+00:00 r1n1 slurmstepd[108556]: Munge
credential signature plugin loaded
     2019-06-06T09:51:54.366478+00:00 r1n1 slurmstepd[108556]:
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB
memsw.limit=unlimited
     2019-06-06T09:51:54.366755+00:00 r1n1 slurmstepd[108556]:
task/cgroup: /slurm/uid_2000/job_11057/step_batch: alloc=1024MB
mem.limit=1024MB memsw.limit=unlimited
     2019-06-06T09:51:54.393325+00:00 r1n1 slurmstepd[108556]: debug
level = 2
     2019-06-06T09:51:54.393754+00:00 r1n1 slurmstepd[108556]: starting
1 tasks
     2019-06-06T09:51:54.401243+00:00 r1n1 slurmstepd[108556]: task 0
(108561) started 2019-06-06T09:51:54
     2019-06-06T09:51:54.416396+00:00 r1n1 slurmstepd[108561]:
task_p_pre_launch: Using sched_affinity for tasks
     2019-06-06T09:51:56.514908+00:00 r1n1 slurmstepd[108556]: task 0
(108561) exited with exit code 1.
     2019-06-06T09:51:56.554430+00:00 r1n1 slurmstepd[108556]: job 11057
completed with slurm_rc = 0, job_rc = 256
     2019-06-06T09:51:56.554847+00:00 r1n1 slurmstepd[108556]: sending
REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
     2019-06-06T09:51:56.559856+00:00 r1n1 slurmstepd[108556]: done with job
     2019-06-06T09:51:56.596762+00:00 r1n1 slurmstepd[108548]: Sent
signal 18 to 11057.4294967295
     2019-06-06T09:51:56.598072+00:00 r1n1 slurmstepd[108548]: Sent
signal 15 to 11057.4294967295
     2019-06-06T09:51:56.599141+00:00 r1n1 slurmstepd[108548]:
_oom_event_monitor: oom-kill event count: 1
     2019-06-06T09:51:56.641170+00:00 r1n1 slurmstepd[108548]: done with job

Secundary node (slurmd log):
     2019-06-06T09:51:54.256047+00:00 r1n2 slurmd[84916]: _run_prolog:
run job script took usec=7
     2019-06-06T09:51:54.256432+00:00 r1n2 slurmd[84916]: _run_prolog:
prolog with lock for job 11057 ran for 0 seconds
     2019-06-06T09:51:54.279763+00:00 r1n2 slurmstepd[84954]: task
affinity plugin loaded with CPU mask
000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000ffffffffff
     2019-06-06T09:51:54.280448+00:00 r1n2 slurmstepd[84954]: Munge
credential signature plugin loaded
     2019-06-06T09:51:54.313852+00:00 r1n2 slurmstepd[84954]:
task/cgroup: /slurm/uid_2000/job_11057: alloc=1024MB mem.limit=1024MB
memsw.limit=unlimited
     2019-06-06T09:51:54.314502+00:00 r1n2 slurmstepd[84954]:
task/cgroup: /slurm/uid_2000/job_11057/step_extern: alloc=1024MB
mem.limit=1024MB memsw.limit=unlimited
     2019-06-06T09:51:56.564758+00:00 r1n2 slurmstepd[84954]: Sent
signal 18 to 11057.4294967295
     2019-06-06T09:51:56.608028+00:00 r1n2 slurmstepd[84954]: Sent
signal 15 to 11057.4294967295
     2019-06-06T09:51:56.609259+00:00 r1n2 slurmstepd[84954]:
_oom_event_monitor: oom-kill event count: 1
     2019-06-06T09:51:56.638334+00:00 r1n2 slurmstepd[84954]: done with job

Thank you very much again.

--
  Andrés Marín Díaz

  Servicio de Infraestructura e Innovación
  Universidad Politécnica de Madrid

  Centro de Supercomputación y Visualización de Madrid (CeSViMa)
  Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES)
  amarin at cesvima.upm.es<mailto:amarin at cesvima.upm.es> | tel 910679676

  www.cesvima.upm.es<http://www.cesvima.upm.es> | www.twitter.com/cesvima<http://www.twitter.com/cesvima> | www.fb.com/cesvima<http://www.fb.com/cesvima>

--

 Andrés Marín Díaz

 Servicio de Infraestructura e Innovación

 Universidad Politécnica de Madrid

 Centro de Supercomputación y Visualización de Madrid (CeSViMa)

 Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES)

 amarin at cesvima.upm.es<mailto:amarin at cesvima.upm.es> | tel 910679676

 www.cesvima.upm.es<http://www.cesvima.upm.es> | www.twitter.com/cesvima<http://www.twitter.com/cesvima> | www.fb.com/cesvima<http://www.fb.com/cesvima>

--

 Andrés Marín Díaz

 Servicio de Infraestructura e Innovación

 Universidad Politécnica de Madrid

 Centro de Supercomputación y Visualización de Madrid (CeSViMa)

 Campus de Montegancedo. 28223, Pozuelo de Alarcón, Madrid (ES)

 amarin at cesvima.upm.es<mailto:amarin at cesvima.upm.es> | tel 910679676

 www.cesvima.upm.es<http://www.cesvima.upm.es> | www.twitter.com/cesvima<http://www.twitter.com/cesvima> | www.fb.com/cesvima<http://www.fb.com/cesvima>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190606/a3b34679/attachment-0001.html>