[slurm-users] Slurm + IntelMPI

Sat Apr 1 15:43:42 UTC 2023

Mr Hermann Schwärzler,

We use srun and spack for running our Fortran based hydro models. 

We discovered mpirun would run the model, but found it really slow.

So we used srun. To get srun to work with Intel-Mpi, we had to tell spack via
compilers.yaml where the spack libmpi libraries are located. After 
this, we installed the intel-mpi packages and the other dependancies.

YAML: lines 13-18
    environment:
      prepend_path:
        LD_LIBRARY_PATH: '/opt/intel/oneapi/compiler/2022.1.0/linux/compiler/lib/intel64_lin'
      set:
        I_MPI_PMI_LIBRARY: '/opt/slurm/lib/libpmi.so'

Here is a reference: 
https://github.com/NOAA-EMC/spack-stack/blob/develop/configs/sites/aws-pcluster/compilers.yaml

On 21/03/23 17:58 +0100, Hermann Schwärzler wrote:
> Hi everybody,
> 
> in our new cluster we have configured Slurm with
> 
> SelectType=select/cons_tres
> SelectTypeParameters=CR_Core_Memory
> ProctrackType=proctrack/cgroup
> TaskPlugin=task/affinity,task/cgroup
> 
> which I think is quite a usual setup.
> 
> After installing Intel MPI (using Spack v0.19) we saw that there is a
> serious problem with task distribution when we use its mpirun utility - see
> this output of top when we submit a job of an mpi_hello program to one node
> and 6 tasks:
> 
> [...]
>  P COMMAND
> 
> 
> 33  `- slurmstepd: [98749.extern]
> 
> 
> 34      `- sleep 100000000
> 
> 
> 32  `- slurmstepd: [98749.batch]
> 
> 
> 32      `- /bin/bash /var/spool/slurm/slurmd/job98749/slurm_script
> 
> 
> 33          `- /bin/sh /path/to/mpirun bash -c ./mpi_hello_world; sleep 30
>  2              `- mpiexec.hydra bash -c ./mpi_hello_world; sleep 30
> 
> 
>  1                  `- /usr/slurm/bin/srun -N 1 -n 1 --ntasks-per-node 1
> --nodelist n054 --input none /path/to/hydra_bstrap_proxy ...
>  0                      `- /usr/slurm/bin/srun -N 1 -n 1 --ntasks-per-node 1
> --nodelist n054 --input none /path/to/hydra_bstrap_proxy ...
> 32  `- slurmstepd: [98749.0]
> 
> 
>  0      `- /path/to/hydra_pmi_proxy --usize -1 --auto-cleanup 1
> --abort-signal 9
>  0          `- bash -c ./mpi_hello_world; sleep 30
> 
> 
>  0              `- sleep 30
> 
> 
>  0          `- bash -c ./mpi_hello_world; sleep 30
> 
> 
>  0              `- sleep 30
> 
> 
>  0          `- bash -c ./mpi_hello_world; sleep 30
> 
> 
>  0              `- sleep 30
> 
> 
>  0          `- bash -c ./mpi_hello_world; sleep 30
> 
> 
>  0              `- sleep 30
> 
> 
>  0          `- bash -c ./mpi_hello_world; sleep 30
> 
> 
>  0              `- sleep 30
> 
> 
>  0          `- bash -c ./mpi_hello_world; sleep 30
> 
> 
>  0              `- sleep 30
> 
> 
> 
> You see: mpirun starts mpiexec.hydra which starts srun (with options "-N 1
> -n 1") to start hydra_bstrap_proxy. This of course starts a new job-step
> where hydra_bstrap_proxy runs hydra_pmi_proxy to finally start our six
> instances of the desired program.
> 
> The problem in our setup is: that srun only asks for one task explicitly! So
> its job-step gets constrained to one task (and one CPU). And so *all six
> tasks run on one single CPU* (see "P" column of top). :-(
> 
> I found documentation on the internet where others seemed to have had
> similar problems and are recommending to their users to use srun instead of
> mpirun with Intel MPI.
> 
> Is this really the only "solution" to this problem?
> Or are there other ones?
> 
> Regards,
> Hermann
> 

-- 
Theodore Knab
Annapolis Linux Users Group
Nearby Annapolis, Maryland United States of America 
--------------
Life is like riding a unicycle. To keep your balance you must keep moving.