[slurm-users] Unexpected MPI process distribution with the --exclusive flag
Brian Andrus
toomuchit at gmail.com
Tue Jul 30 15:03:59 UTC 2019
I think this may be more on how you are calling mpirun and the mapping
of processes.
With the "--exclusive" option, the processes are given access to all the
cores on each box, so mpirun has a choice. IIRC, the default is to pack
them by slot, so fill one node, then move to the next. Whereas you want
to map by node (one process per node cycling by node)
From the man for mpirun (openmpi):
*--map-by <foo>*
Map to the specified object, defaults to/socket/. Supported options
include slot, hwthread, core, L1cache, L2cache, L3cache, socket,
numa, board, node, sequential, distance, and ppr. Any object can
include modifiers by adding a : and any combination of PE=n (bind n
processing elements to each proc), SPAN (load balance the processes
across the allocation), OVERSUBSCRIBE (allow more processes on a
node than processing elements), and NOOVERSUBSCRIBE. This includes
PPR, where the pattern would be terminated by another colon to
separate it from the modifiers.
so adding "--map-by node" would give you what you are looking for.
Of course, this syntax is for Openmpi's mpirun command, so YMMV
Brian Andrus
On 7/30/2019 5:14 AM, CB wrote:
> Hi Everyone,
>
> I've recently discovered that when an MPI job is submitted with the
> --exclusive flag, Slurm fills up each node even if
> the --ntasks-per-node flag is used to set how many MPI processes is
> scheduled on each node. Without the --exclusive flag, Slurm works
> fine as expected.
>
> Our system is running with Slurm 17.11.7.
>
> The following options works that each node has 16 MPI processes until
> all 980 MPI processes are scheduled.with total of 62 compute nodes.
> Each of the 61 nodes has 16 MPI processes and the last one has 4 MPI
> processes, which is 980 MPI processes in total.
> #SBATCH -n 980
> #SBATCH --ntasks-per-node=16
>
> However, if the --exclusive option is added, Slurm fills up each node
> with 28 MPI processes (the compute node has 28 cores). Interestingly,
> Slurm still allocates 62 compute nodes although only 35 nodes of
> them are actually used to distribute 980 MPI processes.
>
> #SBATCH -n 980
> #SBATCH --ntasks-per-node=16
> #SBATCH --exclusive
>
> Has anyone seen this behavior?
>
> Thanks,
> - Chansup
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190730/0029e881/attachment-0001.htm>
More information about the slurm-users
mailing list