[slurm-users] Unexpected MPI process distribution with the --exclusive flag

Wed Jul 31 13:11:56 UTC 2019

Thanks for the replies.

I didn't specify earlier but we're using Inte MPI and the following
environment variable, I_MPI_JOB_RESPECT_PROCESS_PLACEMENT, fixed my issue.

#SBATCH --ntasks=980
#SBATCH --ntasks-per-node=16
#SBATCH --exclusive

export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off
mpirun -np $SLURM_NTASKS -perhost $SLURM_NTASKS_PER_NODE  /path/to/MPI/app

Thanks,

- Chansup

On Wed, Jul 31, 2019 at 2:01 AM Daniel Letai <dani at letai.org.il> wrote:

>
> On 7/30/19 6:03 PM, Brian Andrus wrote:
>
> I think this may be more on how you are calling mpirun and the mapping of
> processes.
>
> With the "--exclusive" option, the processes are given access to all the
> cores on each box, so mpirun has a choice. IIRC, the default is to pack
> them by slot, so fill one node, then move to the next. Whereas you want to
> map by node (one process per node cycling by node)
>
> From the man for mpirun (openmpi):
> *--map-by <foo>* Map to the specified object, defaults to *socket*.
> Supported options include slot, hwthread, core, L1cache, L2cache, L3cache,
> socket, numa, board, node, sequential, distance, and ppr. Any object can
> include modifiers by adding a : and any combination of PE=n (bind n
> processing elements to each proc), SPAN (load balance the processes across
> the allocation), OVERSUBSCRIBE (allow more processes on a node than
> processing elements), and NOOVERSUBSCRIBE. This includes PPR, where the
> pattern would be terminated by another colon to separate it from the
> modifiers.
>
> so adding "--map-by node" would give you what you are looking for.
> Of course, this syntax is for Openmpi's mpirun command, so YMMV
>
> If using srun (as recommended) instead of invoking mpirun directly, you
> can still achieve the same functionality using exported environment
> variables as per the mpirun man page, like this:
>
> OMPI_MCA_rmaps_base_mapping_policy=node srun --export
> OMPI_MCA_rmaps_base_mapping_policy ...
>
> in you sbatch script.
>
> Brian Andrus
>
>
> On 7/30/2019 5:14 AM, CB wrote:
>
> Hi Everyone,
>
> I've recently discovered that when an MPI job is submitted with the
> --exclusive flag, Slurm fills up each node even if the --ntasks-per-node
> flag is used to set how many MPI processes is scheduled on each node.
>  Without the --exclusive flag, Slurm works fine as expected.
>
> Our system is running with Slurm 17.11.7.
>
> The following options works that each node has 16 MPI processes until all
> 980 MPI processes are scheduled.with total of 62 compute nodes.  Each of
> the 61 nodes has 16 MPI processes and the last one has 4 MPI processes,
> which is 980 MPI processes in total.
> #SBATCH -n 980
> #SBATCH --ntasks-per-node=16
>
> However, if the --exclusive option is added, Slurm fills up each node with
> 28 MPI processes (the compute node has 28 cores).  Interestingly, Slurm
> still allocates  62 compute nodes although  only 35 nodes of them are
> actually used to distribute 980 MPI processes.
>
> #SBATCH -n 980
> #SBATCH --ntasks-per-node=16
> #SBATCH --exclusive
>
> Has anyone seen this behavior?
>
> Thanks,
> - Chansup
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190731/dd7004af/attachment.htm>