[slurm-users] error: power_save module disabled, NULL SuspendProgram

Wed Mar 29 12:42:33 UTC 2023

I'd be interested in your kludge, we face a similar situation where the 
slurmctld node
does not have access to the ipmi network and can not ssh to machines 
that have access.
We are thinking on creating a rest interface to a control server which 
would be running the ipmi commands

Ben

On 29-03-2023 14:16, Dr. Thomas Orgis wrote:
> Am Mon, 27 Mar 2023 13:17:01 +0200
> schrieb Ole Holm Nielsen <Ole.H.Nielsen at fysik.dtu.dk>:
>
>> FYI: Slurm power_save works very well for us without the issues that you
>> describe below.  We run Slurm 22.05.8, what's your version?
> I'm sure that there are setups where it works nicely;-) For us, it
> didn't, and I was faced with hunting the bug in slurm or working around
> it with more control, fixing the underlying issue of the node resume
> script being called _after_ the job has been allocated to the node.
> That is too late in case of node bootup failure and causes annoying
> delays for users only to see jobs fail.
>
> We do run 21.08.8-2, which means any debugging of this on the slurm
> side would mean upgrading first (we don't upgrade just for upgrade's
> sake). And, as I said: The issue of the wrong timing remains, unless I
> try deeper changes in slurm's logic. The other issue is that we had a
> kludge in place, anyway, to enable slurmctld to power on nodes via
> IPMI. The machine slurmctld runs on has no access to the IPMI network
> itself, so we had to build a polling communication channel to the node
> which has this access (and which is on another security layer, hence no
> ssh into it). For all I know, this communication kludge is not to
> blame, as, in the spurious failures, the nodes did boot up just fine
> and were ready. Only slurmctld decided to let the timeout pass first,
> then recognize that the slurmd on the node is there, right that instant.
>
> Did your power up/down script workflow work with earlier slurm
> versions, too? Did you use it on bare metal servers or mostly on cloud
> instances?
>
> Do you see a chance for
>
> a) fixing up the internal powersaving logic to properly allocating
>     nodes to a job only when these nodes are actually present (ideally,
>     with a health check passing) or
> b) designing an interface between slurm as manager of available
>     resources and another site-specific service responsible for off-/onlining
>     resources that are known to slurm, but down/drained?
>
> My view is that Slurm's task is to distribute resources among users.
> The cluster manager (person or $MIGHTY_SITE_SPECIFIC_SOFTWARE) decides
> if a node is currently available to Slurm or down for maintenance, for
> example. Power saving would be another reason for a node being taken
> out of service.
>
> Maybe I got an old-fashioned minority view …
>
>
> Alrighty then,
>
> Thomas
>
> PS: I guess solution a) above goes against Slurm's focus on throughput
> and avoiding delays caused by synchronization points, while our idea here
> is that batch jobs where that matters should be written differently,
> packing more than a few seconds worth of work into each step.
>

-- 
---------------------------------------------------------------------
Dr. B.J.W. Polman, C&CZ, Radboud University Nijmegen.
Heyendaalseweg 135, 6525 AJ Nijmegen, The Netherlands, Phone: +31-24-3653360
e-mail: Ben.Polman at science.ru.nl

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xEE3D0443F73E4A1D.asc
Type: application/pgp-keys
Size: 3126 bytes
Desc: OpenPGP public key
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230329/00f811e6/attachment.key>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230329/00f811e6/attachment.sig>