[slurm-users] Reproducible irreproducible problem (timeout?)

Renfro, Michael Renfro at tntech.edu
Wed Dec 20 20:40:40 UTC 2023


Is this Northwestern’s Quest HPC or another one? I know at least a few of the people involved with Quest, and I wouldn’t have thought they’d be in dire need of coaching.

And to follow on with Davide’s point, this really sounds like a case for submitting multiple jobs with dependencies between them, as per [1, 2, 3].

[1] https://services.northwestern.edu/TDClient/30/Portal/KB/ArticleDet?ID=1795
[2] https://bioinformaticsworkbook.org/Appendix/HPC/SLURM/submitting-dependency-jobs-using-slurm.html#gsc.tab=0
[3] https://slurm.schedmd.com/sbatch.html#OPT_dependency

From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Laurence Marks <laurence.marks at gmail.com>
Date: Wednesday, December 20, 2023 at 1:40 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Reproducible irreproducible problem (timeout?)

External Email Warning

This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.

________________________________
It is a University "supercomputer", not a national facility. Hence they are not that expert, which is why I am asking here. I am pretty certain that it is some form of communication issue, but beyond that it is not clear.

If I get suggestions such as "why don't they look for ABC in XYZ" then I may persuade them to look at specifics. They will need the coaching, alas.

On Wed, Dec 20, 2023 at 1:25 PM Gerhard Strangar <g.s at arcor.de<mailto:g.s at arcor.de>> wrote:
Laurence Marks wrote:

> After some (irreproducible) time, often one of the three slow tasks hangs.
> A symptom is that if I try and ssh into the main node of the subtask (which
> is running 128 mpi on the 4 nodes) I get "Authentication failed".

How about asking an admin to check why it hangs?


--
Emeritus Professor Laurence Marks (Laurie)
Northwestern University
Webpage<http://www.numis.northwestern.edu/> and Google Scholar link<http://scholar.google.com/citations?user=zmHhI9gAAAAJ&hl=en>
"Research is to see what everybody else has seen, and to think what nobody else has thought", Albert Szent-Györgyi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231220/b1153f22/attachment-0001.htm>


More information about the slurm-users mailing list