[slurm-users] Heterogeneous job one MPI_COMM_WORLD

Mehlberg, Steve steve.mehlberg at atos.net
Wed Oct 10 08:11:10 MDT 2018


I got this same error when testing on older updates (17.11?).  Try the Slurm-18.08 branch or master.  I'm testing 18.08 now and get this:

[slurm at trek6 mpihello]$ srun -phyper -n3 --mpi=pmi2 --pack-group=0-2 ./mpihello-ompi2-rhel7 | sort
srun: job 643 queued and waiting for resources
srun: job 643 has been allocated resources
Hello world, I am 0 of 9 - running on trek7
Hello world, I am 1 of 9 - running on trek7
Hello world, I am 2 of 9 - running on trek7
Hello world, I am 3 of 9 - running on trek8
Hello world, I am 4 of 9 - running on trek8
Hello world, I am 5 of 9 - running on trek8
Hello world, I am 6 of 9 - running on trek9
Hello world, I am 7 of 9 - running on trek9
Hello world, I am 8 of 9 - running on trek9

-Steve

-----Original Message-----
From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Pritchard Jr., Howard
Sent: Wednesday, October 10, 2018 7:58 AM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Heterogeneous job one MPI_COMM_WORLD

Hi Christopher,

We hit some problems at LANL trying to use this SLURm feature.
At the time, I think SchedMD said there would need to be fixes to the SLURM PMI2 library to get this to work.

What version of SLURM are you using?

Howard


--
Howard Pritchard

B Schedule
HPC-ENV
Office 9, 2nd floor Research Park
TA-03, Building 4200, Room 203

Los Alamos National Laboratory





On 10/9/18, 8:50 PM, "slurm-users on behalf of Gilles Gouaillardet"
<slurm-users-bounces at lists.schedmd.com on behalf of gilles at rist.or.jp>
wrote:

>Christopher,
>
>
>This looks like a SLURM issue and Open MPI is (currently) out of the 
>picture.
>
>
>What if you
>
>
>srun --pack-group=0,1 hostname
>
>
>Do you get a similar error ?
>
>
>Cheers,
>
>Gilles
>
>On 10/10/2018 3:07 AM, Christopher Benjamin Coffey wrote:
>> Hi,
>>
>> I have a user trying to setup a heterogeneous job with one 
>>MPI_COMM_WORLD with the following:
>>
>> ==========
>> #!/bin/bash
>> #SBATCH --job-name=hetero
>> #SBATCH --output=/scratch/cbc/hetero.txt #SBATCH --time=2:00 #SBATCH 
>> --workdir=/scratch/cbc #SBATCH --cpus-per-task=1 --mem-per-cpu=2g 
>> --ntasks=1 -C sb #SBATCH packjob #SBATCH --cpus-per-task=1 
>> --mem-per-cpu=1g  --ntasks=1 -C sl #SBATCH --mail-type=START,END
>>
>> module load openmpi/3.1.2-gcc-6.2.0
>>
>> srun --pack-group=0,1 ~/hellompi
>> ===========
>>
>>
>> Yet, we get an error: " srun: fatal: Job steps that span multiple 
>>components of a heterogeneous job are not currently supported". But 
>>the docs seem to indicate it should work?
>>
>> IMPORTANT: The ability to execute a single application across more 
>>than one job allocation does not work with all MPI implementations or 
>>Slurm MPI plugins. Slurm's ability to execute such an application can 
>>be disabled on the entire cluster by adding "disable_hetero_steps" to 
>>Slurm's SchedulerParameters configuration parameter.
>>
>> By default, the applications launched by a single execution of the 
>>srun command (even for different components of the heterogeneous job) 
>>are combined into one MPI_COMM_WORLD with non-overlapping task IDs.
>>
>> Does this not work with openmpi? If not, which mpi/slurm config will 
>>work? We have slurm.conf MpiDefault=pmi2 currently. I've tried a 
>>modern openmpi, and also mpich, and mvapich2.
>>
>> Any help would be appreciated, thanks!
>>
>> Best,
>> Chris
>>
>>>> Christopher Coffey
>> High-Performance Computing
>> Northern Arizona University
>> 928-523-1167
>>   
>>
>
>



More information about the slurm-users mailing list