<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
</head>
<body>
Hi Brian,<br>
<br>
<blockquote type="cite"
cite="mid:611ef3c4-b025-cd86-09a3-542e48a03aeb@gmail.com">try: <br>
<br>
export SLURM_OVERLAP=1 <br>
export SLURM_WHOLE=1 <br>
<br>
before your salloc and see if that helps. I have seen some mpi
issues that were resolved with that. <br>
</blockquote>
<br>
Unfortunately no dice:<br>
<br>
<font face="monospace">andrej@terra:~$ export SLURM_OVERLAP=1<br>
andrej@terra:~$ export SLURM_WHOLE=1<br>
andrej@terra:~$ salloc -N2 -n2 <br>
salloc: Granted job allocation 864<br>
andrej@terra:~$ srun hostname<br>
srun: launch/slurm: launch_p_step_launch: StepId=864.0 aborted
before step completely launched.<br>
srun: Job step aborted: Waiting up to 32 seconds for job step to
finish.<br>
srun: error: task 1 launch failed: Unspecified error<br>
srun: error: task 0 launch failed: Unspecified error<br>
</font><br>
<blockquote type="cite"
cite="mid:611ef3c4-b025-cd86-09a3-542e48a03aeb@gmail.com">You can
also try it using just the regular mpirun on the nodes allocated.
That will help with a datapoint as well. <br>
</blockquote>
<br>
Same as above, unfortunately.<br>
<br>
<u>But:</u> I can get it to work correctly if I replace
MpiDefault=pmix with MpiDefault=none. It looks like there's
something amiss with pmix support in slurm?<br>
<br>
<font face="monospace">andrej@terra:~$ salloc -N2 -n2<br>
salloc: Granted job allocation 866<br>
andrej@terra:~$ srun hostname<br>
node11<br>
node10<br>
</font><br>
Cheers,<br>
Andrej<br>
<br>
</body>
</html>