[slurm-users] Submitting jobs across multiple nodes fails

Andrej Prsa aprsa09 at gmail.com
Fri Feb 5 02:57:20 UTC 2021


Hi Brian,

> try:
>
> export SLURM_OVERLAP=1
> export SLURM_WHOLE=1
>
> before your salloc and see if that helps. I have seen some mpi issues 
> that were resolved with that.

Unfortunately no dice:

andrej at terra:~$ export SLURM_OVERLAP=1
andrej at terra:~$ export SLURM_WHOLE=1
andrej at terra:~$ salloc -N2 -n2
salloc: Granted job allocation 864
andrej at terra:~$ srun hostname
srun: launch/slurm: launch_p_step_launch: StepId=864.0 aborted before 
step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: task 1 launch failed: Unspecified error
srun: error: task 0 launch failed: Unspecified error

> You can also try it using just the regular mpirun on the nodes 
> allocated. That will help with a datapoint as well.

Same as above, unfortunately.

_But:_ I can get it to work correctly if I replace MpiDefault=pmix with 
MpiDefault=none. It looks like there's something amiss with pmix support 
in slurm?

andrej at terra:~$ salloc -N2 -n2
salloc: Granted job allocation 866
andrej at terra:~$ srun hostname
node11
node10

Cheers,
Andrej

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210204/6454126e/attachment.htm>


More information about the slurm-users mailing list