[slurm-users] salloc problem
Gizo Nanava
nanava at luis.uni-hannover.de
Thu Oct 27 11:18:49 UTC 2022
Hello,
we run into another issue when using salloc interactively on a cluster where Slurm
power saving is enabled. The problem seems to be caused by the job_container plugin
and occurs when the job starts on a node which boots from a power down state.
If I resubmit a job immediately after the failure to the same node, it always works.
I can't find any other way to reproduce the issue other than booting a reserved node from a power down state.
Is this a known issue?
srun and sbatch don't have the problem.
We use slurm 22.05.3.
> salloc --nodelist=isu-n001
salloc: Granted job allocation 791670
salloc: Waiting for resource configuration
salloc: Nodes isu-n001 are ready for job
slurmstepd: error: container_p_join: open failed for /scratch/job_containers/791670/.ns: No such file or directory
slurmstepd: error: container_g_join failed: 791670
slurmstepd: error: write to unblock task 0 failed: Broken pipe
srun: error: isu-n001: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=791670.interactive
salloc: Relinquishing job allocation 791670
# Slurm controller configs
#
> cat /etc/slurm/slurm.conf
..
JobContainerType=job_container/tmpfs
..
LaunchParameters=use_interactive_step
InteractiveStepOptions="--interactive --preserve-env --pty $SHELL -l"
# Job_container
#
> cat /etc/slurm/job_container.conf
AutoBasePath=true
BasePath=/scratch/job_containers
Thank you & kind regards
Gizo
More information about the slurm-users
mailing list