[slurm-users] salloc problem

Thu Oct 27 11:18:49 UTC 2022

Hello, 

we run into another issue when using salloc interactively on a cluster where Slurm 
power saving is enabled. The problem seems to be caused by the job_container plugin
and occurs when the job starts on a node which boots from a power down state.
If I resubmit a job immediately after the failure to the same node, it always works. 
I can't find any other way to reproduce the issue other than booting a reserved node from a power down state.

Is this a known issue?

srun and sbatch don't have the problem.
We use slurm 22.05.3. 

>  salloc --nodelist=isu-n001
salloc: Granted job allocation 791670                               
salloc: Waiting for resource configuration                                                       
salloc: Nodes isu-n001 are ready for job                                                                                                                                                                          
slurmstepd: error: container_p_join: open failed for /scratch/job_containers/791670/.ns: No such file or directory
slurmstepd: error: container_g_join failed: 791670                                                                                                                          
slurmstepd: error: write to unblock task 0 failed: Broken pipe      
srun: error: isu-n001: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=791670.interactive                            
salloc: Relinquishing job allocation 791670             

# Slurm controller configs
#
> cat /etc/slurm/slurm.conf
..
JobContainerType=job_container/tmpfs
..
LaunchParameters=use_interactive_step
InteractiveStepOptions="--interactive --preserve-env --pty $SHELL -l"

# Job_container
#    
> cat /etc/slurm/job_container.conf
AutoBasePath=true
BasePath=/scratch/job_containers

Thank you & kind regards
Gizo