[slurm-users] Reproducible irreproducible problem (timeout?)

Laurence Marks laurence.marks at gmail.com
Wed Dec 20 20:45:22 UTC 2023


Dependencies is not an appropriate approach.

---
Professor Laurence Marks (Laurie)
www.numis.northwestern.edu
"Research is to see what everybody else has seen, and to think what nobody
else has thought" Albert Szent-Györgyi

On Wed, Dec 20, 2023, 14:40 Renfro, Michael <Renfro at tntech.edu> wrote:

> Is this Northwestern’s Quest HPC or another one? I know at least a few of
> the people involved with Quest, and I wouldn’t have thought they’d be in
> dire need of coaching.
>
>
>
> And to follow on with Davide’s point, this really sounds like a case for
> submitting multiple jobs with dependencies between them, as per [1, 2, 3].
>
>
>
> [1]
> https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1795
>
> [2]
> https://bioinformaticsworkbook.org/Appendix/HPC/SLURM/submitting-dependency-jobs-using-slurm.html#gsc.tab=0
>
> [3] https://slurm.schedmd.com/sbatch.html#OPT_dependency
>
>
>
> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Laurence Marks <laurence.marks at gmail.com>
> *Date: *Wednesday, December 20, 2023 at 1:40 PM
> *To: *Slurm User Community List <slurm-users at lists.schedmd.com>
> *Subject: *Re: [slurm-users] Reproducible irreproducible problem
> (timeout?)
>
> *External Email Warning*
>
> *This email originated from outside the university. Please use caution
> when opening attachments, clicking links, or responding to requests.*
> ------------------------------
>
> It is a University "supercomputer", not a national facility. Hence they
> are not that expert, which is why I am asking here. I am pretty certain
> that it is some form of communication issue, but beyond that it is not
> clear.
>
>
>
> If I get suggestions such as "why don't they look for ABC in XYZ" then I
> may persuade them to look at specifics. They will need the coaching, alas.
>
>
>
> On Wed, Dec 20, 2023 at 1:25 PM Gerhard Strangar <g.s at arcor.de> wrote:
>
> Laurence Marks wrote:
>
> > After some (irreproducible) time, often one of the three slow tasks
> hangs.
> > A symptom is that if I try and ssh into the main node of the subtask
> (which
> > is running 128 mpi on the 4 nodes) I get "Authentication failed".
>
> How about asking an admin to check why it hangs?
>
>
>
>
> --
>
> Emeritus Professor Laurence Marks (Laurie)
>
> Northwestern University
>
> Webpage <http://www.numis.northwestern.edu/> and Google Scholar link
> <http://scholar.google.com/citations?user=zmHhI9gAAAAJ&hl=en>
>
> "Research is to see what everybody else has seen, and to think what nobody
> else has thought", Albert Szent-Györgyi
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231220/2ecf18c3/attachment.htm>


More information about the slurm-users mailing list