[slurm-users] salloc problem
Gizo Nanava
nanava at luis.uni-hannover.de
Wed Nov 30 12:53:39 UTC 2022
Sorry for this very late response.
The directory where job containers are to be created is of course already there - it is the local filesystem.
We also start slurmd as a very last process once a node is ready to accept jobs.
That seems to be either a feature of salloc or a bug in Slurm, presumable caused by some race conditions -
in very rare cases, salloc works without this issue.
I see that doc on the Slurm power saving mentions about salloc, but not for the case of interactive use of it.
Thank you & best regards
Gizo
> On 27/10/22 4:18 am, Gizo Nanava wrote:
>
> > we run into another issue when using salloc interactively on a cluster where Slurm
> > power saving is enabled. The problem seems to be caused by the job_container plugin
> > and occurs when the job starts on a node which boots from a power down state.
> > If I resubmit a job immediately after the failure to the same node, it always works.
> > I can't find any other way to reproduce the issue other than booting a reserved node from a power down state.
>
> Looking at this:
>
> > slurmstepd: error: container_p_join: open failed for /scratch/job_containers/791670/.ns: No such file or directory
>
> I'm wondering is a separate filesystem and, if so, could /scratch be
> only getting mounted _after_ slurmd has started on the node?
>
> If that's the case then it would explain the error and why it works
> immediately after.
>
> On our systems we always try and ensure that slurmd is the very last
> thing to start on a node, and it only starts if everything has succeeded
> up to that point.
>
> All the best,
> Chris
> --
> Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
>
>
More information about the slurm-users
mailing list