[slurm-users] error: power_save module disabled, NULL SuspendProgram

Wed Mar 29 13:51:51 UTC 2023

Hi Thomas,

I think the Slurm power_save is not problematic for us with bare-metal 
on-premise nodes, in contrast to the problems you're having.

We use power_save with on-premise nodes where we control the power down/up 
by means of IPMI commands as provided in the scripts which you will find 
in https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save
There's no hokus-pocus once the IPMI commands are working correctly with 
your nodes.

Of course, our slurmctld server can communicate with our IPMI management 
network to perform power management.  I don't see this network access as a 
security problem.

I think we had power_save with IPMI working also in Slurm 21.08 before we 
upgraded to 22.05.

As for job scheduling, slurmctld may allocate a job to some powered-off 
nodes and then calls the ResumeProgram defined in slurm.conf.  From this 
point it may indeed take 2-3 minutes before a node is up and running 
slurmd, during which time it will have a state of POWERING_UP (see "man 
sinfo").  If this doesn't complete after ResumeTimeout seconds, the node 
will go into a failed state.  All this logic seems to be working well.

If you would like to try out the above mentioned IPMI scripts, you could 
test them on a node on your IPMI network to see if you can reliably power 
some nodes up and down.  If this works, hopefully you could configure 
slurmctld so that it executes the scripts (note: it will be run by the 
"slurm" user).

Best regards,
Ole

On 3/29/23 14:16, Dr. Thomas Orgis wrote:
> Am Mon, 27 Mar 2023 13:17:01 +0200
> schrieb Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk>:
> 
>> FYI: Slurm power_save works very well for us without the issues that you
>> describe below.  We run Slurm 22.05.8, what's your version?
> 
> I'm sure that there are setups where it works nicely;-) For us, it
> didn't, and I was faced with hunting the bug in slurm or working around
> it with more control, fixing the underlying issue of the node resume
> script being called _after_ the job has been allocated to the node.
> That is too late in case of node bootup failure and causes annoying
> delays for users only to see jobs fail.
> 
> We do run 21.08.8-2, which means any debugging of this on the slurm
> side would mean upgrading first (we don't upgrade just for upgrade's
> sake). And, as I said: The issue of the wrong timing remains, unless I
> try deeper changes in slurm's logic. The other issue is that we had a
> kludge in place, anyway, to enable slurmctld to power on nodes via
> IPMI. The machine slurmctld runs on has no access to the IPMI network
> itself, so we had to build a polling communication channel to the node
> which has this access (and which is on another security layer, hence no
> ssh into it). For all I know, this communication kludge is not to
> blame, as, in the spurious failures, the nodes did boot up just fine
> and were ready. Only slurmctld decided to let the timeout pass first,
> then recognize that the slurmd on the node is there, right that instant.
> 
> Did your power up/down script workflow work with earlier slurm
> versions, too? Did you use it on bare metal servers or mostly on cloud
> instances?
> 
> Do you see a chance for
> 
> a) fixing up the internal powersaving logic to properly allocating
>     nodes to a job only when these nodes are actually present (ideally,
>     with a health check passing) or
> b) designing an interface between slurm as manager of available
>     resources and another site-specific service responsible for off-/onlining
>     resources that are known to slurm, but down/drained?
> 
> My view is that Slurm's task is to distribute resources among users.
> The cluster manager (person or $MIGHTY_SITE_SPECIFIC_SOFTWARE) decides
> if a node is currently available to Slurm or down for maintenance, for
> example. Power saving would be another reason for a node being taken
> out of service.
> 
> Maybe I got an old-fashioned minority view …
> 
> 
> Alrighty then,
> 
> Thomas
> 
> PS: I guess solution a) above goes against Slurm's focus on throughput
> and avoiding delays caused by synchronization points, while our idea here
> is that batch jobs where that matters should be written differently,
> packing more than a few seconds worth of work into each step.
>