[slurm-users] Slurm powersave

Wed Oct 4 21:03:11 UTC 2023

I'm experimenting with slurm powersave and I have several questions. I'm
following the guidance from https://slurm.schedmd.com/power_save.html and
the great presentation from our own
https://slurm.schedmd.com/SLUG23/DTU-SLUG23.pdf

I am running slurm 23.02.3

1) I'm not sure I fully understand ReconfigFlags=KeepPowerSaveSettings
The documentations ways that if set, an "scontrol reconfig" command will
preserve the current state of SuspendExcNodes, SuspendExcParts and
SuspendExcStates. Why would one *NOT* want to preserve that? What would
happen if one does not (or does) have this setting? For now I'm using it,
assuming that it means "if I run scontrol reconfig" don't shut off nodes
that are up because I said so that they should be up in slurm.conf with
those three options" --- but I am not clear if that is really what it says.

2) the PDF above says that the problem with nodes in down and drained state
is solved in 23.02 but that does not appear to be the case. Before running
my experiment, I had

$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Not responding       root      2023-09-13T13:14:50 node31
ECC memory errors    root      2023-08-26T07:21:04 node27

and after it became

$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Not responding       root      2023-09-13T13:14:50 node31
none                 Unknown   Unknown             node27

And that despite having excluded drain'ed nodes as below:

--- a/slurm/slurm.conf
+++ b/slurm/slurm.conf
@@ -140,12 +140,15 @@ SlurmdLogFile=/var/log/slurm/slurmd.log
 #
 #
 # POWER SAVE SUPPORT FOR IDLE NODES (optional)
+SuspendProgram=/opt/slurm/poweroff
+ResumeProgram=/opt/slurm/poweron
+SuspendTimeout=120
+ResumeTimeout=240
 #ResumeRate=
+SuspendExcNodes=node[13-32]:2
+SuspendExcStates=down,drain,fail,maint,not_responding,reserved
+BatchStartTimeout=60
+ReconfigFlags=KeepPowerSaveSettings # not sure if needed: preserve current
status when running "scontrol reconfig"
-PartitionName=compute512 Default=False Nodes=node[13-32] State=UP
DefMemPerCPU=9196
+PartitionName=compute512 Default=False Nodes=node[13-32] State=UP
DefMemPerCPU=9196 SuspendTime=600

so probably that's not solved? Anyway, that's a nuisance, not a deal breaker

3) The whole thing does not appear to be working as I intended. My
understanding of the "exclude node" above should have meant that slurm
should never attempt to shut off more than all idle nodes in that partition
minus 2. Instead it shut them off all of them, and then tried to turn them
back on:

$ sinfo | grep 512
compute512     up   infinite      1 alloc# node15
compute512     up   infinite      2  idle# node[14,32]
compute512     up   infinite      3  down~ node[16-17,31]
compute512     up   infinite      1 drain~ node27
compute512     up   infinite     12  idle~ node[18-26,28-30]
compute512     up   infinite      1  alloc node13

But again this is a minor nuisance which I can live with (especially if it
happens only when I "flip the switch"), and I'm mentioning only in case
it's a symptom of something else I'm doing wrong. I did try to use both the
SuspendExcNodes=node[13-32]:2 syntax as it seem more reasonable to me
(compared to the rest of the file, e.g. partitions definition) and the
SuspendExcNodes=node[13\-32]:2 as suggested in the slurm powersave
documentation. Behavior, exactly identical

4) Most importantly from the output above you may have noticed two nodes
(actually three by the time I ran the command below) that slurm deemed down

$ sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Not responding       root      2023-09-13T13:14:50 node31
reboot timed out     slurm     2023-10-04T14:51:28 node14
reboot timed out     slurm     2023-10-04T14:52:28 node15
reboot timed out     slurm     2023-10-04T14:49:58 node32
none                 Unknown   Unknown             node27

This can't be the case, the nodes are fine, and cannot have timed out while
"rebooting", because for now my poweroff and poweron script are identical
and literally a simple one-liner bash script doing almost nothing and the
log file is populated correctly as I would expect

echo "Pretending to $0 the following node(s): $1"  >> $log_file 2>&1

So I can confirm slurm invoked the script, but then waited for something
(what? starting slurmd?) which failed to occur and marked the node down.
When I removed the suspend time from the partition to end the experiment,
the other nodes went "magically" in production , without slurm calling my
poweron script. Of course the nodes were never powered off, but slurm
thought they were, so why it did not have the problem it id with the node
which instead intentionally tried to power on?

Thanks for any light you can shed on these issues, particularly the last
one!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20231004/de9ef5f5/attachment-0001.htm>