[slurm-users] Reproducible irreproducible problem (timeout?)
Laurence Marks
laurence.marks at gmail.com
Wed Dec 20 19:39:04 UTC 2023
It is a University "supercomputer", not a national facility. Hence they are
not that expert, which is why I am asking here. I am pretty certain that it
is some form of communication issue, but beyond that it is not clear.
If I get suggestions such as "why don't they look for ABC in XYZ" then I
may persuade them to look at specifics. They will need the coaching, alas.
On Wed, Dec 20, 2023 at 1:25 PM Gerhard Strangar <g.s at arcor.de> wrote:
> Laurence Marks wrote:
>
> > After some (irreproducible) time, often one of the three slow tasks
> hangs.
> > A symptom is that if I try and ssh into the main node of the subtask
> (which
> > is running 128 mpi on the 4 nodes) I get "Authentication failed".
>
> How about asking an admin to check why it hangs?
>
>
--
Emeritus Professor Laurence Marks (Laurie)
Northwestern University
Webpage <http://www.numis.northwestern.edu> and Google Scholar link
<http://scholar.google.com/citations?user=zmHhI9gAAAAJ&hl=en>
"Research is to see what everybody else has seen, and to think what nobody
else has thought", Albert Szent-Györgyi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231220/08250dac/attachment.htm>
More information about the slurm-users
mailing list