[slurm-users] Slurm Power Saving & salloc

Gizo Nanava nanava at luis.uni-hannover.de
Tue Oct 25 12:07:48 UTC 2022


Please ignore the question - the option SchedulerParameters=salloc_wait_nodes solves the issue.

kind regards 
Gizo


> Hello, 
> 
> it seems that in a cluster configured for power saving, salloc does not wait until the nodes 
> assigned to the job recover from the power down state and go back to normal operation
> 
> Although the job is in the state CONFIGURING and the node are still in IDLE+NOT_RESPONDING+POWERING_UP,
> the nodes are declared ready for the job and srun is invoked (on our cluster, salloc is configured 
> for an interactive use. We have LaunchParameters=use_interactive_step in slurm.conf), 
> which of course fails as the nodes are still booting.
> 
> Is this the expected behavior of salloc ?
> 
> Srun and sbatch work as expected.
> 
> We use Slurm 22.05.3
> 
> > salloc --nodelist=taurus-n008
> ......
> salloc: Waiting for resource configuration
> salloc: Nodes taurus-n008 are ready for job
> srun: error: Task launch for StepId=766789.interactive failed on node taurus-n008: Communication connection failure
> srun: error: Application launch failed: Communication connection failure
> srun: Job step aborted
> salloc: Relinquishing job allocation 766789
> 
> > scontrol show nodes taurus-n008
> ......
> State=IDLE+NOT_RESPONDING+POWERING_UP
> ....
> 
> > scontrol show job 766789
> .....
> JobState=CONFIGURING Reason=None Dependency=(null)
> NodeList=taurus-n008
> 
> Thank you & kind regards
> Gizo
>



More information about the slurm-users mailing list