[slurm-users] salloc problem

Wed Nov 30 12:53:39 UTC 2022

Sorry for this very late response.

The directory where job containers are to be created is of course already there - it is the local filesystem.
We also start slurmd as a very last process once a node is ready to accept jobs.
That seems to be either a feature of salloc or a bug in Slurm, presumable caused by some race conditions - 
in very rare cases, salloc works without this issue.
I see that doc on the Slurm power saving mentions about salloc, but not for the case of interactive use of it.

Thank you & best regards
Gizo

> On 27/10/22 4:18 am, Gizo Nanava wrote:
> 
> > we run into another issue when using salloc interactively on a cluster where Slurm
> > power saving is enabled. The problem seems to be caused by the job_container plugin
> > and occurs when the job starts on a node which boots from a power down state.
> > If I resubmit a job immediately after the failure to the same node, it always works.
> > I can't find any other way to reproduce the issue other than booting a reserved node from a power down state.
> 
> Looking at this:
> 
> > slurmstepd: error: container_p_join: open failed for /scratch/job_containers/791670/.ns: No such file or directory
> 
> I'm wondering is a separate filesystem and, if so, could /scratch be 
> only getting mounted _after_ slurmd has started on the node?
> 
> If that's the case then it would explain the error and why it works 
> immediately after.
> 
> On our systems we always try and ensure that slurmd is the very last 
> thing to start on a node, and it only starts if everything has succeeded 
> up to that point.
> 
> All the best,
> Chris
> -- 
> Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
> 
>