[slurm-users] Reproducible irreproducible problem (timeout?)

Davide DelVento davide.quantum at gmail.com
Wed Dec 20 17:33:59 UTC 2023


Not an answer to your question, but if the jobs need to be subdivided, why
not submit smaller jobs?

Also, this does not sound like a slurm problem, but rather a code or
infrastructure issue.

Finally, are you typically able to ssh into the main node of each subtask?
In many places that is not allowed and you would get the "Authentication
failed" error regardless... Some places (but definitely not all) allow
instead logging in with something like

srun --jobid <nnnn> --pty bash

Where obviously <nnnn> is your job ID. Hope this helps


On Wed, Dec 20, 2023 at 6:34 AM Laurence Marks <laurence.marks at gmail.com>
wrote:

> I know that sounds improbable, but please readon.
>
> I am running a reasonably large job on a University supercomputer (not a
> national facility) with 12 nodes on 64 core nodes. The job loops through a
> sequence of commands some of which are single cpu, but with a slow step
> where 3 tasks each with 4 nodes running hybrid omp/mpi are launched. I use
> mpirun for this (Intel impi), which in turn uses srun for each. These slow
> steps run for about 50 minutes. The full job runs for 48 hours, and I am
> typically queueing 11 of these at a time to run in parallel on different
> nodes.
>
> After some (irreproducible) time, often one of the three slow tasks hangs.
> A symptom is that if I try and ssh into the main node of the subtask (which
> is running 128 mpi on the 4 nodes) I get "Authentication failed". Sometimes
> I can kill the mpiexec on the main parent node and this will propagate and
> I can continue (with some fault handling).
>
> I know most people expect a single srun to be used, rather than a complex
> loop as above. The reason is that it is much, much more efficient to
> subdivide the problem, and also code maintenance is better with
> subproblems. This is an established code (been around 20+ years). I wonder
> if there are some timeouts or something similar which drop connectivity. I
> also wonder whether repeated launching of srun subtasks might be doing
> something beyond what is normally expected.
>
> --
> Emeritus Professor Laurence Marks (Laurie)
> Northwestern University
> Webpage <http://www.numis.northwestern.edu> and Google Scholar link
> <http://scholar.google.com/citations?user=zmHhI9gAAAAJ&hl=en>
> "Research is to see what everybody else has seen, and to think what nobody
> else has thought", Albert Szent-Györgyi
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231220/887d0f32/attachment-0001.htm>


More information about the slurm-users mailing list