[slurm-users] Intel MPI issue with slurm sbatch
Joe Teumer
joe.teumer at gmail.com
Wed Aug 17 14:21:20 UTC 2022
Fixed with:
Hydra Environment Variables (intel.com)
<https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference/hydra-environment-variables.html>
I_MPI_HYDRA_BOOTSTRAP=ssh
On Tue, Aug 16, 2022 at 11:09 AM Joe Teumer <joe.teumer at gmail.com> wrote:
> Hello!
>
> Is there a way to turn off slurm MPI hooks?
> A job submitted via sbatch executes Intel MPI and the thread affinity
> settings are incorrect.
> However, running MPI manually over SSH works and all bindings are correct.
>
> We are looking to run our MPI jobs via slurm sbatch and have the same
> behavior as running the job manually over SSH.
>
> slurmd -V
> slurm 22.05.3
>
> RUNNING OMP_NUM_THREADS=, cmd=numactl -C 0-63,128-191 -m 0 mpirun -verbose
> -genv I_MPI_DEBUG=4 -genv KMP_AFFINITY=verbose,granularity=fine,compact -np
> 64 -ppn 64 ./mpiprogram -in in.program -log program -pk intel 0 omp 2 -sf
> intel -screen none -v d 1
>
> which mpirun
> /opt/intel/psxe_runtime_2019.6.324/linux/mpi/intel64/bin/mpirun
>
> slurm sbatch:
>
> [mpiexec at node] *Launch arguments: /usr/local/bin/srun -N 1 -n 1
> --ntasks-per-node 1 --nodelist node --input none
> /opt/intel/psxe_runtime_2019.6.324/linux/mpi/intel64/bin//hydra_bstrap_proxy* --upstream-host
> node --upstream-port 45427 --pgid 0 --launcher slurm --launcher-number 1
> --base-path /opt/intel/psxe_runtime_2019.6.324/linux/mpi/intel64/bin/
> --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug
> /opt/intel/psxe_runtime_2019.6.324/linux/mpi/intel64/bin//hydra_pmi_proxy
> --usize -1 --auto-cleanup 1 --abort-signal 9
>
> SSH manual run:
>
> [mpiexec at node] Launch arguments:
> */opt/intel/psxe_runtime_2019.6.324/linux/mpi/intel64/bin//hydra_bstrap_proxy* --upstream-host
> node --upstream-port 35747 --pgid 0 --launcher ssh --launcher-number 0
> --base-path /opt/intel/psxe_runtime_2019.6.324/linux/mpi/intel64/bin/
> --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug
> --proxy-id 0 --node-id 0 --subtree-size 1 --upstream-fd 7
> /opt/intel/psxe_runtime_2019.6.324/linux/mpi/intel64/bin//hydra_pmi_proxy
> --usize -1 --auto-cleanup 1 --abort-signal 9
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220817/185dd4ee/attachment.htm>
More information about the slurm-users
mailing list