[slurm-users] salloc problem
Chris Samuel
chris at csamuel.org
Mon Oct 31 05:44:00 UTC 2022
On 27/10/22 4:18 am, Gizo Nanava wrote:
> we run into another issue when using salloc interactively on a cluster where Slurm
> power saving is enabled. The problem seems to be caused by the job_container plugin
> and occurs when the job starts on a node which boots from a power down state.
> If I resubmit a job immediately after the failure to the same node, it always works.
> I can't find any other way to reproduce the issue other than booting a reserved node from a power down state.
Looking at this:
> slurmstepd: error: container_p_join: open failed for /scratch/job_containers/791670/.ns: No such file or directory
I'm wondering is a separate filesystem and, if so, could /scratch be
only getting mounted _after_ slurmd has started on the node?
If that's the case then it would explain the error and why it works
immediately after.
On our systems we always try and ensure that slurmd is the very last
thing to start on a node, and it only starts if everything has succeeded
up to that point.
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
More information about the slurm-users
mailing list