[slurm-users] Multinode MPI job

Mahmood Naderan mahmood.nt at gmail.com
Thu Mar 28 15:55:19 UTC 2019


BTW, when I manually run on a node, e.g. compute-0-2, I get this output


]$ mpirun -np 4 pw.x -i mos2.rlx.in

     Program PWSCF v.6.2 starts on 28Mar2019 at 11:40:36

     This program is part of the open-source Quantum ESPRESSO suite
     for quantum simulation of materials; please cite
         "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
         "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
          URL http://www.quantum-espresso.org",
     in publications or presentations arising from this work. More details
at
     http://www.quantum-espresso.org/quote

     Parallel version (MPI), running on     4 processors

     MPI processes distributed on     1 nodes
     R & G space division:  proc/nbgrp/npool/nimage =       4
     Reading input from mos2.rlx.in
Warning: card &CELL ignored
Warning: card     CELL_DYNAMICS  = "BFGS" ignored
Warning: card     PRESS_CONV_THR =  5.00000E-01 ignored
Warning: card / ignored

     Current dimensions of program PWSCF are:
     Max number of different atomic species (ntypx) = 10
     Max number of k-points (npk) =  40000
     Max angular momentum in pseudopotentials (lmaxx) =  3
               file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)  4S
renormalized

     Subspace diagonalization in iterative solution of the eigenvalue
problem:
     a serial algorithm will be used

     Found symmetry operation: I + (  0.0000  0.1667  0.0000)
...
...
...


Regards,
Mahmood




On Thu, Mar 28, 2019 at 8:23 PM Mahmood Naderan <mahmood.nt at gmail.com>
wrote:

> The run is not consistent. I have manually test "mpirun -np 4 pw.x -i
> mos2.rlx.in" on compute-0-2 and rocks7 nodes and it is fine.
> However, with the script "srun --pack-group=0 --ntasks=2 : --pack-group=1
> --ntasks=4 pw.x -i mos2.rlx.in" I see some errors in the output file
> which results in abortion after 60 seconds.
>
> The errors are about not finding some files. Although the config file uses
> absolute path for the intermediate files and files are existed, the errors
> sound bizarre.
>
>
> compute-0-2
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>  3387 ghatee    20   0 1930488 129684   8336 R 100.0  0.2   0:09.71 pw.x
>  3388 ghatee    20   0 1930476 129700   8336 R  99.7  0.2   0:09.68 pw.x
>
>
>
> rocks7
>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
>  5592 ghatee    20   0 1930568 127764   8336 R 100.0  0.2   0:17.29 pw.x
>   549 ghatee    20   0  116844   3652   1804 S   0.0  0.0   0:00.14 bash
>
>
>
> As you can see, 2 tasks are fine on compute-0-2, but there should be 4
> tasks on rocks7.
> Input file contains
>     outdir        = "/home/ghatee/job/2h-unitcell" ,
>     pseudo_dir    = "/home/ghatee/q-e-qe-5.4/pseudo/" ,
>
>
> The output file says
>
>      Program PWSCF v.6.2 starts on 28Mar2019 at 11:43:58
>
>      This program is part of the open-source Quantum ESPRESSO suite
>      for quantum simulation of materials; please cite
>          "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
>          "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
>           URL http://www.quantum-espresso.org",
>      in publications or presentations arising from this work. More details
> at
>      http://www.quantum-espresso.org/quote
>
>      Parallel version (MPI), running on     1 processors
>
>      MPI processes distributed on     1 nodes
>      Reading input from mos2.rlx.in
> Warning: card &CELL ignored
> Warning: card     CELL_DYNAMICS  = "BFGS" ignored
> Warning: card     PRESS_CONV_THR =  5.00000E-01 ignored
> Warning: card / ignored
>
>      Program PWSCF v.6.2 starts on 28Mar2019 at 11:43:58
>
>      This program is part of the open-source Quantum ESPRESSO suite
>      for quantum simulation of materials; please cite
>          "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
>          "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
>           URL http://www.quantum-espresso.org",
>      in publications or presentations arising from this work. More details
> at
>      http://www.quantum-espresso.org/quote
>
>      Parallel version (MPI), running on     1 processors
>
>      MPI processes distributed on     1 nodes
>      Reading input from mos2.rlx.in
> Warning: card &CELL ignored
> Warning: card     CELL_DYNAMICS  = "BFGS" ignored
> Warning: card     PRESS_CONV_THR =  5.00000E-01 ignored
> Warning: card / ignored
>
>      Current dimensions of program PWSCF are:
>      Max number of different atomic species (ntypx) = 10
>      Max number of k-points (npk) =  40000
>      Max angular momentum in pseudopotentials (lmaxx) =  3
>
>      Current dimensions of program PWSCF are:
>      Max number of different atomic species (ntypx) = 10
>      Max number of k-points (npk) =  40000
>      Max angular momentum in pseudopotentials (lmaxx) =  3
>
>      Program PWSCF v.6.2 starts on 28Mar2019 at 20:13:58
>
>      This program is part of the open-source Quantum ESPRESSO suite
>      for quantum simulation of materials; please cite
>          "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
>          "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
>           URL http://www.quantum-espresso.org",
>      in publications or presentations arising from this work. More details
> at
>      http://www.quantum-espresso.org/quote
>
>      Parallel version (MPI), running on     1 processors
>
>      MPI processes distributed on     1 nodes
>      Reading input from mos2.rlx.in
> Warning: card &CELL ignored
> Warning: card     CELL_DYNAMICS  = "BFGS" ignored
> Warning: card     PRESS_CONV_THR =  5.00000E-01 ignored
> Warning: card / ignored
>
>      Current dimensions of program PWSCF are:
>      Max number of different atomic species (ntypx) = 10
>      Max number of k-points (npk) =  40000
>      Max angular momentum in pseudopotentials (lmaxx) =  3
>
>      Program PWSCF v.6.2 starts on 28Mar2019 at 20:13:58
>
>      This program is part of the open-source Quantum ESPRESSO suite
>      for quantum simulation of materials; please cite
>          "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
>          "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
>           URL http://www.quantum-espresso.org",
>      in publications or presentations arising from this work. More details
> at
>      http://www.quantum-espresso.org/quote
>
>      Parallel version (MPI), running on     1 processors
>
>      MPI processes distributed on     1 nodes
>      Reading input from mos2.rlx.in
> Warning: card &CELL ignored
> Warning: card     CELL_DYNAMICS  = "BFGS" ignored
> Warning: card     PRESS_CONV_THR =  5.00000E-01 ignored
> Warning: card / ignored
>
>      Current dimensions of program PWSCF are:
>      Max number of different atomic species (ntypx) = 10
>      Max number of k-points (npk) =  40000
>      Max angular momentum in pseudopotentials (lmaxx) =  3
>
>      Program PWSCF v.6.2 starts on 28Mar2019 at 20:13:58
>
>      This program is part of the open-source Quantum ESPRESSO suite
>      for quantum simulation of materials; please cite
>          "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
>          "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
>           URL http://www.quantum-espresso.org",
>      in publications or presentations arising from this work. More details
> at
>      http://www.quantum-espresso.org/quote
>
>      Program PWSCF v.6.2 starts on 28Mar2019 at 20:13:58
>
>      This program is part of the open-source Quantum ESPRESSO suite
>      for quantum simulation of materials; please cite
>          "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
>          "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
>           URL http://www.quantum-espresso.org",
>      in publications or presentations arising from this work. More details
> at
>      http://www.quantum-espresso.org/quote
>
>      Parallel version (MPI), running on     1 processors
>
>      MPI processes distributed on     1 nodes
>
>      Parallel version (MPI), running on     1 processors
>
>      MPI processes distributed on     1 nodes
>      Reading input from mos2.rlx.in
>      Reading input from mos2.rlx.in
> Warning: card &CELL ignored
> Warning: card     CELL_DYNAMICS  = "BFGS" ignored
> Warning: card &CELL ignored
> Warning: card     CELL_DYNAMICS  = "BFGS" ignored
> Warning: card     PRESS_CONV_THR =  5.00000E-01 ignored
> Warning: card / ignored
> Warning: card     PRESS_CONV_THR =  5.00000E-01 ignored
> Warning: card / ignored
>
>      Current dimensions of program PWSCF are:
>      Max number of different atomic species (ntypx) = 10
>      Max number of k-points (npk) =  40000
>      Max angular momentum in pseudopotentials (lmaxx) =  3
>
>      Current dimensions of program PWSCF are:
>      Max number of different atomic species (ntypx) = 10
>      Max number of k-points (npk) =  40000
>      Max angular momentum in pseudopotentials (lmaxx) =  3
>                file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)
> 4S renormalized
>                file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)
> 4S renormalized
>                file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)
> 4S renormalized
>                file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)
> 4S renormalized
>                file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)
> 4S renormalized
>                file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)
> 4S renormalized
> ERROR(FoX)
> Cannot open file
> ERROR(FoX)
> Cannot open file
>
>
>  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>      Error in routine read_ncpp (2):
>      pseudo file is empty or wrong
>
>  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>
>      stopping ...
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode 1.
> ...
> ...
> ...
>
>
>
>
>
> Verifying that there are 6 " Parallel version (MPI), running on     1
> processors" lines, it seems that it starts normally as I specified in the
> slurm script. However, I am suspecious that the program is NOT multicore
> MPI job. It is 6 instances of a serial run and there may be some races
> during the run.
> Any thought?
>
> Regards,
> Mahmood
>
>
>
>
> On Thu, Mar 28, 2019 at 3:59 PM Frava <fravadona at gmail.com> wrote:
>
>> I didn't receive the last mail from Mahmood but Marcus is right,
>> Mahmood's heterogeneous job submission seems to be working now.
>> Well, separating each pack in the srun command and asking for the correct
>> number of tasks to be launched for each pack is the way I figured the
>> heterogeneous jobs worked with SLURM v18.08.0 (I didn't test it with more
>> recent SLURM versions).
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190328/32244f66/attachment-0001.html>


More information about the slurm-users mailing list