[slurm-users] Slurm Power Saving & salloc
nanava at luis.uni-hannover.de
Tue Oct 25 12:07:48 UTC 2022
Please ignore the question - the option SchedulerParameters=salloc_wait_nodes solves the issue.
> it seems that in a cluster configured for power saving, salloc does not wait until the nodes
> assigned to the job recover from the power down state and go back to normal operation
> Although the job is in the state CONFIGURING and the node are still in IDLE+NOT_RESPONDING+POWERING_UP,
> the nodes are declared ready for the job and srun is invoked (on our cluster, salloc is configured
> for an interactive use. We have LaunchParameters=use_interactive_step in slurm.conf),
> which of course fails as the nodes are still booting.
> Is this the expected behavior of salloc ?
> Srun and sbatch work as expected.
> We use Slurm 22.05.3
> > salloc --nodelist=taurus-n008
> salloc: Waiting for resource configuration
> salloc: Nodes taurus-n008 are ready for job
> srun: error: Task launch for StepId=766789.interactive failed on node taurus-n008: Communication connection failure
> srun: error: Application launch failed: Communication connection failure
> srun: Job step aborted
> salloc: Relinquishing job allocation 766789
> > scontrol show nodes taurus-n008
> > scontrol show job 766789
> JobState=CONFIGURING Reason=None Dependency=(null)
> Thank you & kind regards
More information about the slurm-users