[slurm-users] running mpi from inside an mpi job

Vanhorn, Mike michael.vanhorn at wright.edu
Tue Jun 20 14:08:47 UTC 2023

I have a user who is submitting a job to slurm which requests 16 tasks, i.e.

#SBATCH --ntasks 16
#SBATCH –cpus-per-task 1

The slurm script runs an mpi program called Parent.mpi, which then (fails to) call 15 mpi child processes. He’s tried two different ways for the parent to spawn the children:

  1.  A system() call, such as system(“srun --ntasks=4  mpirun -np 4 ./child.mpi”) or system(“mpirun -np 4 ./child.mpi”)

  1.   MPI_Comm_Spawn

Both ways generate the following in the slurm output file:

srun: Job ### step creation temporarily disabled, retrying (Requested nodes are busy)
srun: error: Unable to create step for job ###: Job/step already completing or completed

So, basically, he’s requesting 16 tasks, one of which is used by the parent and the other 15 are supposed to get used by the children, but the children can’t use the other 16 because...well, I’m not sure why.

Is there something I need to change in the slurm.conf to allow this to work?

Mike VanHorn
Senior Computer Systems Administrator
College of Engineering and Computer Science
Wright State University
265 Russ Engineering Center
michael.vanhorn at wright.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230620/5d95ceae/attachment.htm>

More information about the slurm-users mailing list