[slurm-users] Multinode MPI job
Frava
fravadona at gmail.com
Thu Mar 28 16:31:45 UTC 2019
Well, does it also crash when you run it with two nodes in a normal way
(not using heterogeneous jobs) ?
#!/bin/bash
#SBATCH --job-name=myQE_2Nx2MPI
#SBATCH --output=big-mem
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --mem-per-cpu=16g
#SBATCH --partition=QUARTZ
#SBATCH --account=z5
#
srun pw.x -i mos2.rlx.in
Le jeu. 28 mars 2019 à 16:57, Mahmood Naderan <mahmood.nt at gmail.com> a
écrit :
> BTW, when I manually run on a node, e.g. compute-0-2, I get this output
>
>
> ]$ mpirun -np 4 pw.x -i mos2.rlx.in
>
> Program PWSCF v.6.2 starts on 28Mar2019 at 11:40:36
>
> This program is part of the open-source Quantum ESPRESSO suite
> for quantum simulation of materials; please cite
> "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
> "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
> URL http://www.quantum-espresso.org",
> in publications or presentations arising from this work. More details
> at
> http://www.quantum-espresso.org/quote
>
> Parallel version (MPI), running on 4 processors
>
> MPI processes distributed on 1 nodes
> R & G space division: proc/nbgrp/npool/nimage = 4
> Reading input from mos2.rlx.in
> Warning: card &CELL ignored
> Warning: card CELL_DYNAMICS = "BFGS" ignored
> Warning: card PRESS_CONV_THR = 5.00000E-01 ignored
> Warning: card / ignored
>
> Current dimensions of program PWSCF are:
> Max number of different atomic species (ntypx) = 10
> Max number of k-points (npk) = 40000
> Max angular momentum in pseudopotentials (lmaxx) = 3
> file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)
> 4S renormalized
>
> Subspace diagonalization in iterative solution of the eigenvalue
> problem:
> a serial algorithm will be used
>
> Found symmetry operation: I + ( 0.0000 0.1667 0.0000)
> ...
> ...
> ...
>
>
> Regards,
> Mahmood
>
>
>
>
> On Thu, Mar 28, 2019 at 8:23 PM Mahmood Naderan <mahmood.nt at gmail.com>
> wrote:
>
>> The run is not consistent. I have manually test "mpirun -np 4 pw.x -i
>> mos2.rlx.in" on compute-0-2 and rocks7 nodes and it is fine.
>> However, with the script "srun --pack-group=0 --ntasks=2 : --pack-group=1
>> --ntasks=4 pw.x -i mos2.rlx.in" I see some errors in the output file
>> which results in abortion after 60 seconds.
>>
>> The errors are about not finding some files. Although the config file
>> uses absolute path for the intermediate files and files are existed, the
>> errors sound bizarre.
>>
>>
>> compute-0-2
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
>> COMMAND
>> 3387 ghatee 20 0 1930488 129684 8336 R 100.0 0.2 0:09.71 pw.x
>> 3388 ghatee 20 0 1930476 129700 8336 R 99.7 0.2 0:09.68 pw.x
>>
>>
>>
>> rocks7
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
>> COMMAND
>> 5592 ghatee 20 0 1930568 127764 8336 R 100.0 0.2 0:17.29 pw.x
>> 549 ghatee 20 0 116844 3652 1804 S 0.0 0.0 0:00.14 bash
>>
>>
>>
>> As you can see, 2 tasks are fine on compute-0-2, but there should be 4
>> tasks on rocks7.
>> Input file contains
>> outdir = "/home/ghatee/job/2h-unitcell" ,
>> pseudo_dir = "/home/ghatee/q-e-qe-5.4/pseudo/" ,
>>
>>
>> The output file says
>>
>> Program PWSCF v.6.2 starts on 28Mar2019 at 11:43:58
>>
>> This program is part of the open-source Quantum ESPRESSO suite
>> for quantum simulation of materials; please cite
>> "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
>> "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
>> URL http://www.quantum-espresso.org",
>> in publications or presentations arising from this work. More
>> details at
>> http://www.quantum-espresso.org/quote
>>
>> Parallel version (MPI), running on 1 processors
>>
>> MPI processes distributed on 1 nodes
>> Reading input from mos2.rlx.in
>> Warning: card &CELL ignored
>> Warning: card CELL_DYNAMICS = "BFGS" ignored
>> Warning: card PRESS_CONV_THR = 5.00000E-01 ignored
>> Warning: card / ignored
>>
>> Program PWSCF v.6.2 starts on 28Mar2019 at 11:43:58
>>
>> This program is part of the open-source Quantum ESPRESSO suite
>> for quantum simulation of materials; please cite
>> "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
>> "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
>> URL http://www.quantum-espresso.org",
>> in publications or presentations arising from this work. More
>> details at
>> http://www.quantum-espresso.org/quote
>>
>> Parallel version (MPI), running on 1 processors
>>
>> MPI processes distributed on 1 nodes
>> Reading input from mos2.rlx.in
>> Warning: card &CELL ignored
>> Warning: card CELL_DYNAMICS = "BFGS" ignored
>> Warning: card PRESS_CONV_THR = 5.00000E-01 ignored
>> Warning: card / ignored
>>
>> Current dimensions of program PWSCF are:
>> Max number of different atomic species (ntypx) = 10
>> Max number of k-points (npk) = 40000
>> Max angular momentum in pseudopotentials (lmaxx) = 3
>>
>> Current dimensions of program PWSCF are:
>> Max number of different atomic species (ntypx) = 10
>> Max number of k-points (npk) = 40000
>> Max angular momentum in pseudopotentials (lmaxx) = 3
>>
>> Program PWSCF v.6.2 starts on 28Mar2019 at 20:13:58
>>
>> This program is part of the open-source Quantum ESPRESSO suite
>> for quantum simulation of materials; please cite
>> "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
>> "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
>> URL http://www.quantum-espresso.org",
>> in publications or presentations arising from this work. More
>> details at
>> http://www.quantum-espresso.org/quote
>>
>> Parallel version (MPI), running on 1 processors
>>
>> MPI processes distributed on 1 nodes
>> Reading input from mos2.rlx.in
>> Warning: card &CELL ignored
>> Warning: card CELL_DYNAMICS = "BFGS" ignored
>> Warning: card PRESS_CONV_THR = 5.00000E-01 ignored
>> Warning: card / ignored
>>
>> Current dimensions of program PWSCF are:
>> Max number of different atomic species (ntypx) = 10
>> Max number of k-points (npk) = 40000
>> Max angular momentum in pseudopotentials (lmaxx) = 3
>>
>> Program PWSCF v.6.2 starts on 28Mar2019 at 20:13:58
>>
>> This program is part of the open-source Quantum ESPRESSO suite
>> for quantum simulation of materials; please cite
>> "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
>> "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
>> URL http://www.quantum-espresso.org",
>> in publications or presentations arising from this work. More
>> details at
>> http://www.quantum-espresso.org/quote
>>
>> Parallel version (MPI), running on 1 processors
>>
>> MPI processes distributed on 1 nodes
>> Reading input from mos2.rlx.in
>> Warning: card &CELL ignored
>> Warning: card CELL_DYNAMICS = "BFGS" ignored
>> Warning: card PRESS_CONV_THR = 5.00000E-01 ignored
>> Warning: card / ignored
>>
>> Current dimensions of program PWSCF are:
>> Max number of different atomic species (ntypx) = 10
>> Max number of k-points (npk) = 40000
>> Max angular momentum in pseudopotentials (lmaxx) = 3
>>
>> Program PWSCF v.6.2 starts on 28Mar2019 at 20:13:58
>>
>> This program is part of the open-source Quantum ESPRESSO suite
>> for quantum simulation of materials; please cite
>> "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
>> "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
>> URL http://www.quantum-espresso.org",
>> in publications or presentations arising from this work. More
>> details at
>> http://www.quantum-espresso.org/quote
>>
>> Program PWSCF v.6.2 starts on 28Mar2019 at 20:13:58
>>
>> This program is part of the open-source Quantum ESPRESSO suite
>> for quantum simulation of materials; please cite
>> "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
>> "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
>> URL http://www.quantum-espresso.org",
>> in publications or presentations arising from this work. More
>> details at
>> http://www.quantum-espresso.org/quote
>>
>> Parallel version (MPI), running on 1 processors
>>
>> MPI processes distributed on 1 nodes
>>
>> Parallel version (MPI), running on 1 processors
>>
>> MPI processes distributed on 1 nodes
>> Reading input from mos2.rlx.in
>> Reading input from mos2.rlx.in
>> Warning: card &CELL ignored
>> Warning: card CELL_DYNAMICS = "BFGS" ignored
>> Warning: card &CELL ignored
>> Warning: card CELL_DYNAMICS = "BFGS" ignored
>> Warning: card PRESS_CONV_THR = 5.00000E-01 ignored
>> Warning: card / ignored
>> Warning: card PRESS_CONV_THR = 5.00000E-01 ignored
>> Warning: card / ignored
>>
>> Current dimensions of program PWSCF are:
>> Max number of different atomic species (ntypx) = 10
>> Max number of k-points (npk) = 40000
>> Max angular momentum in pseudopotentials (lmaxx) = 3
>>
>> Current dimensions of program PWSCF are:
>> Max number of different atomic species (ntypx) = 10
>> Max number of k-points (npk) = 40000
>> Max angular momentum in pseudopotentials (lmaxx) = 3
>> file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)
>> 4S renormalized
>> file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)
>> 4S renormalized
>> file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)
>> 4S renormalized
>> file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)
>> 4S renormalized
>> file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)
>> 4S renormalized
>> file Mo.revpbe-spn-rrkjus_psl.0.3.0.UPF: wavefunction(s)
>> 4S renormalized
>> ERROR(FoX)
>> Cannot open file
>> ERROR(FoX)
>> Cannot open file
>>
>>
>> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>> Error in routine read_ncpp (2):
>> pseudo file is empty or wrong
>>
>> %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
>>
>> stopping ...
>> --------------------------------------------------------------------------
>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>> with errorcode 1.
>> ...
>> ...
>> ...
>>
>>
>>
>>
>>
>> Verifying that there are 6 " Parallel version (MPI), running on 1
>> processors" lines, it seems that it starts normally as I specified in the
>> slurm script. However, I am suspecious that the program is NOT multicore
>> MPI job. It is 6 instances of a serial run and there may be some races
>> during the run.
>> Any thought?
>>
>> Regards,
>> Mahmood
>>
>>
>>
>>
>> On Thu, Mar 28, 2019 at 3:59 PM Frava <fravadona at gmail.com> wrote:
>>
>>> I didn't receive the last mail from Mahmood but Marcus is right,
>>> Mahmood's heterogeneous job submission seems to be working now.
>>> Well, separating each pack in the srun command and asking for the
>>> correct number of tasks to be launched for each pack is the way I figured
>>> the heterogeneous jobs worked with SLURM v18.08.0 (I didn't test it with
>>> more recent SLURM versions).
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190328/d664411e/attachment-0001.html>
More information about the slurm-users
mailing list