[slurm-users] Slurm Power Saving & salloc

Gizo Nanava nanava at luis.uni-hannover.de
Mon Oct 24 12:28:52 UTC 2022


Hello, 

it seems that in a cluster configured for power saving, salloc does not wait until the nodes 
assigned to the job recover from the power down state and go back to normal operation

Although the job is in the state CONFIGURING and the node are still in IDLE+NOT_RESPONDING+POWERING_UP,
the nodes are declared ready for the job and srun is invoked (on our cluster, salloc is configured 
for an interactive use. We have LaunchParameters=use_interactive_step in slurm.conf), 
which of course fails as the nodes are still booting.

Is this the expected behavior of salloc ?

Srun and sbatch work as expected.

We use Slurm 22.05.3

> salloc --nodelist=taurus-n008
......
salloc: Waiting for resource configuration
salloc: Nodes taurus-n008 are ready for job
srun: error: Task launch for StepId=766789.interactive failed on node taurus-n008: Communication connection failure
srun: error: Application launch failed: Communication connection failure
srun: Job step aborted
salloc: Relinquishing job allocation 766789

> scontrol show nodes taurus-n008
......
State=IDLE+NOT_RESPONDING+POWERING_UP
....

> scontrol show job 766789
.....
JobState=CONFIGURING Reason=None Dependency=(null)
NodeList=taurus-n008

Thank you & kind regards
Gizo



More information about the slurm-users mailing list