<div dir="ltr"><div dir="ltr"><div>Not an answer to your question, but if the jobs need to be subdivided, why not submit smaller jobs?</div><div><br></div><div>Also, this does not sound like a slurm problem, but rather a code or infrastructure issue. </div><div><br></div><div>Finally, are you typically able to ssh into the main node of each subtask? In many places that is not allowed and you would get the "Authentication failed" error regardless... Some places (but definitely not all) allow instead logging in with something like</div><div><br></div><div><span style="color:rgb(23,43,77);font-family:SFMono-Medium,"SF Mono","Segoe UI Mono","Roboto Mono","Ubuntu Mono",Menlo,Courier,monospace;font-size:14px">srun --jobid <nnnn> --pty bash</span><br></div></div><br><div class="gmail_quote"><div class="gmail_attr">Where obviously <nnnn> is your job ID. Hope this helps</div><div class="gmail_attr"><br></div><div class="gmail_attr"><br></div><div dir="ltr" class="gmail_attr">On Wed, Dec 20, 2023 at 6:34 AM Laurence Marks <<a href="mailto:laurence.marks@gmail.com">laurence.marks@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div style="font-family:verdana,sans-serif;color:rgb(0,0,0)">I know that sounds improbable, but please readon.</div><div style="font-family:verdana,sans-serif;color:rgb(0,0,0)"><br></div><div style="font-family:verdana,sans-serif;color:rgb(0,0,0)">I am running a reasonably large job on a University supercomputer (not a national facility) with 12 nodes on 64 core nodes. The job loops through a sequence of commands some of which are single cpu, but with a slow step where 3 tasks each with 4 nodes running hybrid omp/mpi are launched. I use mpirun for this (Intel impi), which in turn uses srun for each. These slow steps run for about 50 minutes. The full job runs for 48 hours, and I am typically queueing 11 of these at a time to run in parallel on different nodes.</div><div style="font-family:verdana,sans-serif;color:rgb(0,0,0)"><br></div><div style="font-family:verdana,sans-serif;color:rgb(0,0,0)">After some (irreproducible) time, often one of the three slow tasks hangs. A symptom is that if I try and ssh into the main node of the subtask (which is running 128 mpi on the 4 nodes) I get "Authentication failed". Sometimes I can kill the mpiexec on the main parent node and this will propagate and I can continue (with some fault handling).</div><div style="font-family:verdana,sans-serif;color:rgb(0,0,0)"><br></div><div><div style="font-family:verdana,sans-serif;color:rgb(0,0,0)">I know most people expect a single srun to be used, rather than a complex loop as above. The reason is that it is much, much more efficient to subdivide the problem, and also code maintenance is better with subproblems. This is an established code (been around 20+ years). I wonder if there are some timeouts or something similar which drop connectivity. I also wonder whether repeated launching of srun subtasks might be doing something beyond what is normally expected.</div><br></div><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature"><div dir="ltr">Emeritus Professor Laurence Marks (Laurie)<div>Northwestern University<br><div><a href="http://www.numis.northwestern.edu" target="_blank">Webpage</a> and <a href="http://scholar.google.com/citations?user=zmHhI9gAAAAJ&hl=en" target="_blank">Google Scholar link</a></div><div>"Research is to see what everybody else has seen, and to think what nobody else has thought", Albert Szent-Györgyi</div></div></div></div></div>

</blockquote></div></div>