[slurm-users] Multinode MPI job

Mahmood Naderan mahmood.nt at gmail.com
Thu Mar 28 15:53:37 UTC 2019


The run is not consistent. I have manually test "mpirun -np 4 pw.x -i
mos2.rlx.in" on compute-0-2 and rocks7 nodes and it is fine.
However, with the script "srun --pack-group=0 --ntasks=2 : --pack-group=1
--ntasks=4 pw.x -i mos2.rlx.in" I see some errors in the output file which
results in abortion after 60 seconds.

The errors are about not finding some files. Although the config file uses
absolute path for the intermediate files and files are existed, the errors
sound bizarre.


compute-0-2
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3387 ghatee    20   0 1930488 129684   8336 R 100.0  0.2   0:09.71 pw.x
 3388 ghatee    20   0 1930476 129700   8336 R  99.7  0.2   0:09.68 pw.x



rocks7
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 5592 ghatee    20   0 1930568 127764   8336 R 100.0  0.2   0:17.29 pw.x
  549 ghatee    20   0  116844   3652   1804 S   0.0  0.0   0:00.14 bash



As you can see, 2 tasks are fine on compute-0-2, but there should be 4
tasks on rocks7.
Input file contains
    outdir        = "/home/ghatee/job/2h-unitcell" ,
    pseudo_dir    = "/home/ghatee/q-e-qe-5.4/pseudo/" ,


The output file says

     Program PWSCF v.6.2 starts on 28Mar2019 at 11:43:58

     This program is part of the open-source Quantum ESPRESSO suite
     for quantum simulation of materials; please cite
         "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
         "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
          URL http://www.quantum-espresso.org",
     in publications or presentations arising from this work. More details
at
     http://www.quantum-espresso.org/quote

     Parallel version (MPI), running on     1 processors

     MPI processes distributed on     1 nodes
     Reading input from mos2.rlx.in
Warning: card &CELL ignored
Warning: card     CELL_DYNAMICS  = "BFGS" ignored
Warning: card     PRESS_CONV_THR =  5.00000E-01 ignored
Warning: card / ignored

     Program PWSCF v.6.2 starts on 28Mar2019 at 11:43:58

     This program is part of the open-source Quantum ESPRESSO suite
     for quantum simulation of materials; please cite
         "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
         "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
          URL http://www.quantum-espresso.org",
     in publications or presentations arising from this work. More details
at
     http://www.quantum-espresso.org/quote

     Parallel version (MPI), running on     1 processors

     MPI processes distributed on     1 nodes
     Reading input from mos2.rlx.in
Warning: card &CELL ignored
Warning: card     CELL_DYNAMICS  = "BFGS" ignored
Warning: card     PRESS_CONV_THR =  5.00000E-01 ignored
Warning: card / ignored

     Current dimensions of program PWSCF are:
     Max number of different atomic species (ntypx) = 10
     Max number of k-points (npk) =  40000
     Max angular momentum in pseudopotentials (lmaxx) =  3

     Current dimensions of program PWSCF are:
     Max number of different atomic species (ntypx) = 10
     Max number of k-points (npk) =  40000
     Max angular momentum in pseudopotentials (lmaxx) =  3

     Program PWSCF v.6.2 starts on 28Mar2019 at 20:13:58

     This program is part of the open-source Quantum ESPRESSO suite
     for quantum simulation of materials; please cite
         "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
         "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
          URL http://www.quantum-espresso.org",
     in publications or presentations arising from this work. More details
at
     http://www.quantum-espresso.org/quote

     Parallel version (MPI), running on     1 processors

     MPI processes distributed on     1 nodes
     Reading input from mos2.rlx.in
Warning: card &CELL ignored
Warning: card     CELL_DYNAMICS  = "BFGS" ignored
Warning: card     PRESS_CONV_THR =  5.00000E-01 ignored
Warning: card / ignored

     Current dimensions of program PWSCF are:
     Max number of different atomic species (ntypx) = 10
     Max number of k-points (npk) =  40000
     Max angular momentum in pseudopotentials (lmaxx) =  3

     Program PWSCF v.6.2 starts on 28Mar2019 at 20:13:58

     This program is part of the open-source Quantum ESPRESSO suite
     for quantum simulation of materials; please cite
         "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
         "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
          URL http://www.quantum-espresso.org",
     in publications or presentations arising from this work. More details
at
     http://www.quantum-espresso.org/quote

     Parallel version (MPI), running on     1 processors

     MPI processes distributed on     1 nodes
     Reading input from mos2.rlx.in
Warning: card &CELL ignored
Warning: card     CELL_DYNAMICS  = "BFGS" ignored
Warning: card     PRESS_CONV_THR =  5.00000E-01 ignored
Warning: card / ignored

     Current dimensions of program PWSCF are:
     Max number of different atomic species (ntypx) = 10
     Max number of k-points (npk) =  40000
     Max angular momentum in pseudopotentials (lmaxx) =  3

     Program PWSCF v.6.2 starts on 28Mar2019 at 20:13:58

     This program is part of the open-source Quantum ESPRESSO suite
     for quantum simulation of materials; please cite
         "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
         "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
          URL http://www.quantum-espresso.org",
     in publications or presentations arising from this work. More details
at
     http://www.quantum-espresso.org/quote

     Program PWSCF v.6.2 starts on 28Mar2019 at 20:13:58

     This program is part of the open-source Quantum ESPRESSO suite
     for quantum simulation of materials; please cite
         "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
         "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
          URL http://www.quantum-espresso.org",
     in publications or presentations arising from this work. More details
at
     http://www.quantum-espresso.org/quote

     Parallel version (MPI), running on     1 processors

     MPI processes distributed on     1 nodes

     Parallel version (MPI), running on     1 processors

     MPI processes distributed on     1 nodes
     Reading input from mos2.rlx.in
     Reading input from mos2.rlx.in
Warning: card &CELL ignored
Warning: card     CELL_DYNAMICS  = "BFGS" ignored
Warning: card &CELL ignored
Warning: card     CELL_DYNAMICS  = "BFGS" ignored
Warning: card     PRESS_CONV_THR =  5.00000E-01 ignored
Warning: card / ignored
Warning: card     PRESS_CONV_THR =  5.00000E-01 ignored
Warning: card / ignored

     Current dimensions of program PWSCF are:
     Max number of different atomic species (ntypx) = 10
     Max number of k-points (npk) =  40000
     Max angular momentum in pseudopotentials (lmaxx) =  3

     Current dimensions of program PWSCF are:
     Max number of different atomic species (ntypx) = 10
     Max number of k-points (npk) =  40000
     Max angular momentum in pseudopotentials (lmaxx) =  3
               file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)  4S
renormalized
               file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)  4S
renormalized
               file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)  4S
renormalized
               file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)  4S
renormalized
               file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)  4S
renormalized
               file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)  4S
renormalized
ERROR(FoX)
Cannot open file
ERROR(FoX)
Cannot open file

 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     Error in routine read_ncpp (2):
     pseudo file is empty or wrong
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

     stopping ...
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.
...
...
...





Verifying that there are 6 " Parallel version (MPI), running on     1
processors" lines, it seems that it starts normally as I specified in the
slurm script. However, I am suspecious that the program is NOT multicore
MPI job. It is 6 instances of a serial run and there may be some races
during the run.
Any thought?

Regards,
Mahmood




On Thu, Mar 28, 2019 at 3:59 PM Frava <fravadona at gmail.com> wrote:

> I didn't receive the last mail from Mahmood but Marcus is right, Mahmood's
> heterogeneous job submission seems to be working now.
> Well, separating each pack in the srun command and asking for the correct
> number of tasks to be launched for each pack is the way I figured the
> heterogeneous jobs worked with SLURM v18.08.0 (I didn't test it with more
> recent SLURM versions).
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190328/bdd962f8/attachment-0001.html>


More information about the slurm-users mailing list