[slurm-users] Slurm Power Saving & salloc
Gizo Nanava
nanava at luis.uni-hannover.de
Mon Oct 24 12:28:52 UTC 2022
Hello,
it seems that in a cluster configured for power saving, salloc does not wait until the nodes
assigned to the job recover from the power down state and go back to normal operation
Although the job is in the state CONFIGURING and the node are still in IDLE+NOT_RESPONDING+POWERING_UP,
the nodes are declared ready for the job and srun is invoked (on our cluster, salloc is configured
for an interactive use. We have LaunchParameters=use_interactive_step in slurm.conf),
which of course fails as the nodes are still booting.
Is this the expected behavior of salloc ?
Srun and sbatch work as expected.
We use Slurm 22.05.3
> salloc --nodelist=taurus-n008
......
salloc: Waiting for resource configuration
salloc: Nodes taurus-n008 are ready for job
srun: error: Task launch for StepId=766789.interactive failed on node taurus-n008: Communication connection failure
srun: error: Application launch failed: Communication connection failure
srun: Job step aborted
salloc: Relinquishing job allocation 766789
> scontrol show nodes taurus-n008
......
State=IDLE+NOT_RESPONDING+POWERING_UP
....
> scontrol show job 766789
.....
JobState=CONFIGURING Reason=None Dependency=(null)
NodeList=taurus-n008
Thank you & kind regards
Gizo
More information about the slurm-users
mailing list