[slurm-users] Multinode MPI job

Frava fravadona at gmail.com
Thu Mar 28 16:31:45 UTC 2019


Well, does it also crash when you run it with two nodes in a normal way
(not using heterogeneous jobs) ?

#!/bin/bash
#SBATCH --job-name=myQE_2Nx2MPI
#SBATCH --output=big-mem
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --mem-per-cpu=16g
#SBATCH --partition=QUARTZ
#SBATCH --account=z5
#
srun pw.x -i mos2.rlx.in



Le jeu. 28 mars 2019 à 16:57, Mahmood Naderan <mahmood.nt at gmail.com> a
écrit :

> BTW, when I manually run on a node, e.g. compute-0-2, I get this output
>
>
> ]$ mpirun -np 4 pw.x -i mos2.rlx.in
>
>      Program PWSCF v.6.2 starts on 28Mar2019 at 11:40:36
>
>      This program is part of the open-source Quantum ESPRESSO suite
>      for quantum simulation of materials; please cite
>          "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
>          "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
>           URL http://www.quantum-espresso.org",
>      in publications or presentations arising from this work. More details
> at
>      http://www.quantum-espresso.org/quote
>
>      Parallel version (MPI), running on     4 processors
>
>      MPI processes distributed on     1 nodes
>      R & G space division:  proc/nbgrp/npool/nimage =       4
>      Reading input from mos2.rlx.in
> Warning: card &CELL ignored
> Warning: card     CELL_DYNAMICS  = "BFGS" ignored
> Warning: card     PRESS_CONV_THR =  5.00000E-01 ignored
> Warning: card / ignored
>
>      Current dimensions of program PWSCF are:
>      Max number of different atomic species (ntypx) = 10
>      Max number of k-points (npk) =  40000
>      Max angular momentum in pseudopotentials (lmaxx) =  3
>                file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)
> 4S renormalized
>
>      Subspace diagonalization in iterative solution of the eigenvalue
> problem:
>      a serial algorithm will be used
>
>      Found symmetry operation: I + (  0.0000  0.1667  0.0000)
> ...
> ...
> ...
>
>
> Regards,
> Mahmood
>
>
>
>
> On Thu, Mar 28, 2019 at 8:23 PM Mahmood Naderan <mahmood.nt at gmail.com>
> wrote:
>
>> The run is not consistent. I have manually test "mpirun -np 4 pw.x -i
>> mos2.rlx.in" on compute-0-2 and rocks7 nodes and it is fine.
>> However, with the script "srun --pack-group=0 --ntasks=2 : --pack-group=1
>> --ntasks=4 pw.x -i mos2.rlx.in" I see some errors in the output file
>> which results in abortion after 60 seconds.
>>
>> The errors are about not finding some files. Although the config file
>> uses absolute path for the intermediate files and files are existed, the
>> errors sound bizarre.
>>
>>
>> compute-0-2
>>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
>> COMMAND
>>  3387 ghatee    20   0 1930488 129684   8336 R 100.0  0.2   0:09.71 pw.x
>>  3388 ghatee    20   0 1930476 129700   8336 R  99.7  0.2   0:09.68 pw.x
>>
>>
>>
>> rocks7
>>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
>> COMMAND
>>  5592 ghatee    20   0 1930568 127764   8336 R 100.0  0.2   0:17.29 pw.x
>>   549 ghatee    20   0  116844   3652   1804 S   0.0  0.0   0:00.14 bash
>>
>>
>>
>> As you can see, 2 tasks are fine on compute-0-2, but there should be 4
>> tasks on rocks7.
>> Input file contains
>>     outdir        = "/home/ghatee/job/2h-unitcell" ,
>>     pseudo_dir    = "/home/ghatee/q-e-qe-5.4/pseudo/" ,
>>
>>
>> The output file says
>>
>>      Program PWSCF v.6.2 starts on 28Mar2019 at 11:43:58
>>
>>      This program is part of the open-source Quantum ESPRESSO suite
>>      for quantum simulation of materials; please cite
>>          "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
>>          "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
>>           URL http://www.quantum-espresso.org",
>>      in publications or presentations arising from this work. More
>> details at
>>      http://www.quantum-espresso.org/quote
>>
>>      Parallel version (MPI), running on     1 processors
>>
>>      MPI processes distributed on     1 nodes
>>      Reading input from mos2.rlx.in
>> Warning: card &CELL ignored
>> Warning: card     CELL_DYNAMICS  = "BFGS" ignored
>> Warning: card     PRESS_CONV_THR =  5.00000E-01 ignored
>> Warning: card / ignored
>>
>>      Program PWSCF v.6.2 starts on 28Mar2019 at 11:43:58
>>
>>      This program is part of the open-source Quantum ESPRESSO suite
>>      for quantum simulation of materials; please cite
>>          "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
>>          "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
>>           URL http://www.quantum-espresso.org",
>>      in publications or presentations arising from this work. More
>> details at
>>      http://www.quantum-espresso.org/quote
>>
>>      Parallel version (MPI), running on     1 processors
>>
>>      MPI processes distributed on     1 nodes
>>      Reading input from mos2.rlx.in
>> Warning: card &CELL ignored
>> Warning: card     CELL_DYNAMICS  = "BFGS" ignored
>> Warning: card     PRESS_CONV_THR =  5.00000E-01 ignored
>> Warning: card / ignored
>>
>>      Current dimensions of program PWSCF are:
>>      Max number of different atomic species (ntypx) = 10
>>      Max number of k-points (npk) =  40000
>>      Max angular momentum in pseudopotentials (lmaxx) =  3
>>
>>      Current dimensions of program PWSCF are:
>>      Max number of different atomic species (ntypx) = 10
>>      Max number of k-points (npk) =  40000
>>      Max angular momentum in pseudopotentials (lmaxx) =  3
>>
>>      Program PWSCF v.6.2 starts on 28Mar2019 at 20:13:58
>>
>>      This program is part of the open-source Quantum ESPRESSO suite
>>      for quantum simulation of materials; please cite
>>          "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
>>          "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
>>           URL http://www.quantum-espresso.org",
>>      in publications or presentations arising from this work. More
>> details at
>>      http://www.quantum-espresso.org/quote
>>
>>      Parallel version (MPI), running on     1 processors
>>
>>      MPI processes distributed on     1 nodes
>>      Reading input from mos2.rlx.in
>> Warning: card &CELL ignored
>> Warning: card     CELL_DYNAMICS  = "BFGS" ignored
>> Warning: card     PRESS_CONV_THR =  5.00000E-01 ignored
>> Warning: card / ignored
>>
>>      Current dimensions of program PWSCF are:
>>      Max number of different atomic species (ntypx) = 10
>>      Max number of k-points (npk) =  40000
>>      Max angular momentum in pseudopotentials (lmaxx) =  3
>>
>>      Program PWSCF v.6.2 starts on 28Mar2019 at 20:13:58
>>
>>      This program is part of the open-source Quantum ESPRESSO suite
>>      for quantum simulation of materials; please cite
>>          "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
>>          "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
>>           URL http://www.quantum-espresso.org",
>>      in publications or presentations arising from this work. More
>> details at
>>      http://www.quantum-espresso.org/quote
>>
>>      Parallel version (MPI), running on     1 processors
>>
>>      MPI processes distributed on     1 nodes
>>      Reading input from mos2.rlx.in
>> Warning: card &CELL ignored
>> Warning: card     CELL_DYNAMICS  = "BFGS" ignored
>> Warning: card     PRESS_CONV_THR =  5.00000E-01 ignored
>> Warning: card / ignored
>>
>>      Current dimensions of program PWSCF are:
>>      Max number of different atomic species (ntypx) = 10
>>      Max number of k-points (npk) =  40000
>>      Max angular momentum in pseudopotentials (lmaxx) =  3
>>
>>      Program PWSCF v.6.2 starts on 28Mar2019 at 20:13:58
>>
>>      This program is part of the open-source Quantum ESPRESSO suite
>>      for quantum simulation of materials; please cite
>>          "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
>>          "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
>>           URL http://www.quantum-espresso.org",
>>      in publications or presentations arising from this work. More
>> details at
>>      http://www.quantum-espresso.org/quote
>>
>>      Program PWSCF v.6.2 starts on 28Mar2019 at 20:13:58
>>
>>      This program is part of the open-source Quantum ESPRESSO suite
>>      for quantum simulation of materials; please cite
>>          "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
>>          "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
>>           URL http://www.quantum-espresso.org",
>>      in publications or presentations arising from this work. More
>> details at
>>      http://www.quantum-espresso.org/quote
>>
>>      Parallel version (MPI), running on     1 processors
>>
>>      MPI processes distributed on     1 nodes
>>
>>      Parallel version (MPI), running on     1 processors
>>
>>      MPI processes distributed on     1 nodes
>>      Reading input from mos2.rlx.in
>>      Reading input from mos2.rlx.in
>> Warning: card &CELL ignored
>> Warning: card     CELL_DYNAMICS  = "BFGS" ignored
>> Warning: card &CELL ignored
>> Warning: card     CELL_DYNAMICS  = "BFGS" ignored
>> Warning: card     PRESS_CONV_THR =  5.00000E-01 ignored
>> Warning: card / ignored
>> Warning: card     PRESS_CONV_THR =  5.00000E-01 ignored
>> Warning: card / ignored
>>
>>      Current dimensions of program PWSCF are:
>>      Max number of different atomic species (ntypx) = 10
>>      Max number of k-points (npk) =  40000
>>      Max angular momentum in pseudopotentials (lmaxx) =  3
>>
>>      Current dimensions of program PWSCF are:
>>      Max number of different atomic species (ntypx) = 10
>>      Max number of k-points (npk) =  40000
>>      Max angular momentum in pseudopotentials (lmaxx) =  3
>>                file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)
>> 4S renormalized
>>                file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)
>> 4S renormalized
>>                file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)
>> 4S renormalized
>>                file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)
>> 4S renormalized
>>                file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)
>> 4S renormalized
>>                file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)
>> 4S renormalized
>> ERROR(FoX)
>> Cannot open file
>> ERROR(FoX)
>> Cannot open file
>>
>>
>>  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>>      Error in routine read_ncpp (2):
>>      pseudo file is empty or wrong
>>
>>  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>>
>>      stopping ...
>> --------------------------------------------------------------------------
>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>> with errorcode 1.
>> ...
>> ...
>> ...
>>
>>
>>
>>
>>
>> Verifying that there are 6 " Parallel version (MPI), running on     1
>> processors" lines, it seems that it starts normally as I specified in the
>> slurm script. However, I am suspecious that the program is NOT multicore
>> MPI job. It is 6 instances of a serial run and there may be some races
>> during the run.
>> Any thought?
>>
>> Regards,
>> Mahmood
>>
>>
>>
>>
>> On Thu, Mar 28, 2019 at 3:59 PM Frava <fravadona at gmail.com> wrote:
>>
>>> I didn't receive the last mail from Mahmood but Marcus is right,
>>> Mahmood's heterogeneous job submission seems to be working now.
>>> Well, separating each pack in the srun command and asking for the
>>> correct number of tasks to be launched for each pack is the way I figured
>>> the heterogeneous jobs worked with SLURM v18.08.0 (I didn't test it with
>>> more recent SLURM versions).
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190328/d664411e/attachment-0001.html>


More information about the slurm-users mailing list