[slurm-users] Heterogeneous job one MPI_COMM_WORLD
steve.mehlberg at atos.net
Wed Oct 10 08:11:10 MDT 2018
I got this same error when testing on older updates (17.11?). Try the Slurm-18.08 branch or master. I'm testing 18.08 now and get this:
[slurm at trek6 mpihello]$ srun -phyper -n3 --mpi=pmi2 --pack-group=0-2 ./mpihello-ompi2-rhel7 | sort
srun: job 643 queued and waiting for resources
srun: job 643 has been allocated resources
Hello world, I am 0 of 9 - running on trek7
Hello world, I am 1 of 9 - running on trek7
Hello world, I am 2 of 9 - running on trek7
Hello world, I am 3 of 9 - running on trek8
Hello world, I am 4 of 9 - running on trek8
Hello world, I am 5 of 9 - running on trek8
Hello world, I am 6 of 9 - running on trek9
Hello world, I am 7 of 9 - running on trek9
Hello world, I am 8 of 9 - running on trek9
From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Pritchard Jr., Howard
Sent: Wednesday, October 10, 2018 7:58 AM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Heterogeneous job one MPI_COMM_WORLD
We hit some problems at LANL trying to use this SLURm feature.
At the time, I think SchedMD said there would need to be fixes to the SLURM PMI2 library to get this to work.
What version of SLURM are you using?
Office 9, 2nd floor Research Park
TA-03, Building 4200, Room 203
Los Alamos National Laboratory
On 10/9/18, 8:50 PM, "slurm-users on behalf of Gilles Gouaillardet"
<slurm-users-bounces at lists.schedmd.com on behalf of gilles at rist.or.jp>
>This looks like a SLURM issue and Open MPI is (currently) out of the
>What if you
>srun --pack-group=0,1 hostname
>Do you get a similar error ?
>On 10/10/2018 3:07 AM, Christopher Benjamin Coffey wrote:
>> I have a user trying to setup a heterogeneous job with one
>>MPI_COMM_WORLD with the following:
>> #SBATCH --job-name=hetero
>> #SBATCH --output=/scratch/cbc/hetero.txt #SBATCH --time=2:00 #SBATCH
>> --workdir=/scratch/cbc #SBATCH --cpus-per-task=1 --mem-per-cpu=2g
>> --ntasks=1 -C sb #SBATCH packjob #SBATCH --cpus-per-task=1
>> --mem-per-cpu=1g --ntasks=1 -C sl #SBATCH --mail-type=START,END
>> module load openmpi/3.1.2-gcc-6.2.0
>> srun --pack-group=0,1 ~/hellompi
>> Yet, we get an error: " srun: fatal: Job steps that span multiple
>>components of a heterogeneous job are not currently supported". But
>>the docs seem to indicate it should work?
>> IMPORTANT: The ability to execute a single application across more
>>than one job allocation does not work with all MPI implementations or
>>Slurm MPI plugins. Slurm's ability to execute such an application can
>>be disabled on the entire cluster by adding "disable_hetero_steps" to
>>Slurm's SchedulerParameters configuration parameter.
>> By default, the applications launched by a single execution of the
>>srun command (even for different components of the heterogeneous job)
>>are combined into one MPI_COMM_WORLD with non-overlapping task IDs.
>> Does this not work with openmpi? If not, which mpi/slurm config will
>>work? We have slurm.conf MpiDefault=pmi2 currently. I've tried a
>>modern openmpi, and also mpich, and mvapich2.
>> Any help would be appreciated, thanks!
>> Christopher Coffey
>> High-Performance Computing
>> Northern Arizona University
More information about the slurm-users