[slurm-users] salloc problem

Chris Samuel chris at csamuel.org
Mon Oct 31 05:44:00 UTC 2022


On 27/10/22 4:18 am, Gizo Nanava wrote:

> we run into another issue when using salloc interactively on a cluster where Slurm
> power saving is enabled. The problem seems to be caused by the job_container plugin
> and occurs when the job starts on a node which boots from a power down state.
> If I resubmit a job immediately after the failure to the same node, it always works.
> I can't find any other way to reproduce the issue other than booting a reserved node from a power down state.

Looking at this:

> slurmstepd: error: container_p_join: open failed for /scratch/job_containers/791670/.ns: No such file or directory

I'm wondering is a separate filesystem and, if so, could /scratch be 
only getting mounted _after_ slurmd has started on the node?

If that's the case then it would explain the error and why it works 
immediately after.

On our systems we always try and ensure that slurmd is the very last 
thing to start on a node, and it only starts if everything has succeeded 
up to that point.

All the best,
Chris
-- 
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




More information about the slurm-users mailing list