[slurm-users] Heterogeneous job one MPI_COMM_WORLD

Pritchard Jr., Howard howardp at lanl.gov
Wed Oct 10 07:58:21 MDT 2018


Hi Christopher,

We hit some problems at LANL trying to use this SLURm feature.
At the time, I think SchedMD said there would need to be fixes
to the SLURM PMI2 library to get this to work.

What version of SLURM are you using?

Howard


-- 
Howard Pritchard

B Schedule
HPC-ENV
Office 9, 2nd floor Research Park
TA-03, Building 4200, Room 203

Los Alamos National Laboratory





On 10/9/18, 8:50 PM, "slurm-users on behalf of Gilles Gouaillardet"
<slurm-users-bounces at lists.schedmd.com on behalf of gilles at rist.or.jp>
wrote:

>Christopher,
>
>
>This looks like a SLURM issue and Open MPI is (currently) out of the
>picture.
>
>
>What if you
>
>
>srun --pack-group=0,1 hostname
>
>
>Do you get a similar error ?
>
>
>Cheers,
>
>Gilles
>
>On 10/10/2018 3:07 AM, Christopher Benjamin Coffey wrote:
>> Hi,
>>
>> I have a user trying to setup a heterogeneous job with one
>>MPI_COMM_WORLD with the following:
>>
>> ==========
>> #!/bin/bash
>> #SBATCH --job-name=hetero
>> #SBATCH --output=/scratch/cbc/hetero.txt
>> #SBATCH --time=2:00
>> #SBATCH --workdir=/scratch/cbc
>> #SBATCH --cpus-per-task=1 --mem-per-cpu=2g --ntasks=1 -C sb
>> #SBATCH packjob
>> #SBATCH --cpus-per-task=1 --mem-per-cpu=1g  --ntasks=1 -C sl
>> #SBATCH --mail-type=START,END
>>
>> module load openmpi/3.1.2-gcc-6.2.0
>>
>> srun --pack-group=0,1 ~/hellompi
>> ===========
>>
>>
>> Yet, we get an error: " srun: fatal: Job steps that span multiple
>>components of a heterogeneous job are not currently supported". But the
>>docs seem to indicate it should work?
>>
>> IMPORTANT: The ability to execute a single application across more than
>>one job allocation does not work with all MPI implementations or Slurm
>>MPI plugins. Slurm's ability to execute such an application can be
>>disabled on the entire cluster by adding "disable_hetero_steps" to
>>Slurm's SchedulerParameters configuration parameter.
>>
>> By default, the applications launched by a single execution of the srun
>>command (even for different components of the heterogeneous job) are
>>combined into one MPI_COMM_WORLD with non-overlapping task IDs.
>>
>> Does this not work with openmpi? If not, which mpi/slurm config will
>>work? We have slurm.conf MpiDefault=pmi2 currently. I've tried a modern
>>openmpi, and also mpich, and mvapich2.
>>
>> Any help would be appreciated, thanks!
>>
>> Best,
>> Chris
>>
>>>> Christopher Coffey
>> High-Performance Computing
>> Northern Arizona University
>> 928-523-1167
>>   
>>
>
>



More information about the slurm-users mailing list