Hello,
We are enabling global power saving for slurm. This is our config. But we
are experiencing
an issue where Slurm is talking nodes out of power saving mode.
We are using config less and dynamic nodes.
Slurmctrld logs
2025-01-08T09:54:18.096] Cleared POWER_SAVE flag from nodes
dev3-cbf-debug-e2std4-s0-[0-1],dev3-cbf-infra-e2std4-s0-[0-1],dev3-cbf-slurmd-e2std4-[0-4]
[2025-01-08T09:54:18.097] debug: power_save module disabled, SuspendTime <
0
[2025-01-08T10:05:09.683] debug: power_save module disabled, SuspendTime <
0
[2025-01-08T10:12:00.447] debug: power_save module disabled, SuspendTime <
0
[2025-01-08T10:17:27.077] debug: power_save module disabled, SuspendTime <
0
[2025-01-08T10:20:06.735] debug: power_save module disabled, SuspendTime <
0
[2025-01-08T10:26:16.663] debug: power_save module disabled, SuspendTime <
0
[2025-01-08T10:26:19.017] debug: power_save module disabled, SuspendTime <
0
[2025-01-08T10:46:25.078] debug: power_save module disabled, SuspendTime <
0
[2025-01-08T15:58:54.080] debug: power_save module disabled, SuspendTime <
0
[2025-01-08T15:58:56.452] debug: power_save module disabled, SuspendTime <
0
[2025-01-09T00:08:01.993] debug: power_save module disabled, SuspendTime <
0
[2025-01-09T00:08:02.157] debug: power_save module disabled, SuspendTime <
0
[2025-01-09T00:14:26.817] debug: power_save module disabled, SuspendTime <
0
[2025-01-09T00:14:26.971] debug: power_save module disabled, SuspendTime <
0
We manage the infra through Puppet and I can confirm is SuspendTime=3600.
Slurm version:-
slurm 24.05.2
# SLURM POWER SAVING FEATURS
SuspendProgram=/etc/slurm/suspend_nodes_slurm.par
ResumeProgram=/etc/slurm/resume_nodes_slurm.par
SuspendTimeout=600
ResumeTimeout=900
ResumeRate=100
SuspendRate=100
SuspendTime=3600
SlurmctldParameters=enable_configless,cloud_dns,idle_on_node_suspend
DebugFlags=Power