[slurm-users] Reproducible irreproducible problem (timeout?)

Laurence Marks laurence.marks at gmail.com
Wed Dec 20 13:31:51 UTC 2023


I know that sounds improbable, but please readon.

I am running a reasonably large job on a University supercomputer (not a
national facility) with 12 nodes on 64 core nodes. The job loops through a
sequence of commands some of which are single cpu, but with a slow step
where 3 tasks each with 4 nodes running hybrid omp/mpi are launched. I use
mpirun for this (Intel impi), which in turn uses srun for each. These slow
steps run for about 50 minutes. The full job runs for 48 hours, and I am
typically queueing 11 of these at a time to run in parallel on different
nodes.

After some (irreproducible) time, often one of the three slow tasks hangs.
A symptom is that if I try and ssh into the main node of the subtask (which
is running 128 mpi on the 4 nodes) I get "Authentication failed". Sometimes
I can kill the mpiexec on the main parent node and this will propagate and
I can continue (with some fault handling).

I know most people expect a single srun to be used, rather than a complex
loop as above. The reason is that it is much, much more efficient to
subdivide the problem, and also code maintenance is better with
subproblems. This is an established code (been around 20+ years). I wonder
if there are some timeouts or something similar which drop connectivity. I
also wonder whether repeated launching of srun subtasks might be doing
something beyond what is normally expected.

-- 
Emeritus Professor Laurence Marks (Laurie)
Northwestern University
Webpage <http://www.numis.northwestern.edu> and Google Scholar link
<http://scholar.google.com/citations?user=zmHhI9gAAAAJ&hl=en>
"Research is to see what everybody else has seen, and to think what nobody
else has thought", Albert Szent-Györgyi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231220/93b63d3a/attachment.htm>


More information about the slurm-users mailing list