[slurm-users] Reproducible irreproducible problem (timeout?)
Renfro, Michael
Renfro at tntech.edu
Wed Dec 20 20:40:40 UTC 2023
Is this Northwestern’s Quest HPC or another one? I know at least a few of the people involved with Quest, and I wouldn’t have thought they’d be in dire need of coaching.
And to follow on with Davide’s point, this really sounds like a case for submitting multiple jobs with dependencies between them, as per [1, 2, 3].
[1] https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1795
[2] https://bioinformaticsworkbook.org/Appendix/HPC/SLURM/submitting-dependency-jobs-using-slurm.html#gsc.tab=0
[3] https://slurm.schedmd.com/sbatch.html#OPT_dependency
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Laurence Marks <laurence.marks at gmail.com>
Date: Wednesday, December 20, 2023 at 1:40 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Reproducible irreproducible problem (timeout?)
External Email Warning
This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
________________________________
It is a University "supercomputer", not a national facility. Hence they are not that expert, which is why I am asking here. I am pretty certain that it is some form of communication issue, but beyond that it is not clear.
If I get suggestions such as "why don't they look for ABC in XYZ" then I may persuade them to look at specifics. They will need the coaching, alas.
On Wed, Dec 20, 2023 at 1:25 PM Gerhard Strangar <g.s at arcor.de<mailto:g.s at arcor.de>> wrote:
Laurence Marks wrote:
> After some (irreproducible) time, often one of the three slow tasks hangs.
> A symptom is that if I try and ssh into the main node of the subtask (which
> is running 128 mpi on the 4 nodes) I get "Authentication failed".
How about asking an admin to check why it hangs?
--
Emeritus Professor Laurence Marks (Laurie)
Northwestern University
Webpage<http://www.numis.northwestern.edu/> and Google Scholar link<http://scholar.google.com/citations?user=zmHhI9gAAAAJ&hl=en>
"Research is to see what everybody else has seen, and to think what nobody else has thought", Albert Szent-Györgyi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231220/b1153f22/attachment-0001.htm>
More information about the slurm-users
mailing list