[slurm-users] error: power_save module disabled, NULL SuspendProgram

Mon Mar 27 10:57:59 UTC 2023

Am Mon, 06 Mar 2023 13:35:38 +0100
schrieb Stefan Staeglich <staeglis at informatik.uni-freiburg.de>:

> But this fixed not the main error but might have reduced the frequency of 
> occurring. Has someone observed similar issues? We will try a higher 
> SuspendTimeout.

We had issues with power saving. We powered the idle nodes off, causing
a full boot to resume. We observed repeatedly the strange behaviour
that the node is present for a while, but only detected by slurmctld as
being ready right when it is giving up with SuspendTimeout.

But instead of fixing this possibly subtle logic error, we figured that

a) The node suspend support in Slurm was not really designed for full
   power off/on, which can take minutes regularily.

b) This functionality of taking nodes out of/into production is
   something the cluster admin does. This is not in the scope of the
   batch system.

Hence I wrote a script that runs as a service on a superior admin node.
It queries Slurm for idle nodes and pending jobs and then decides which
nodes to drain and then power down or bring back online.

This needs more knowledge on Slurm job and node states than I'd like,
but it works. Ideally, I'd like the powersaving feature of slurm
consisting of a simple interface that can communicate

1. which nodes are probably not needed in the coming x minutes/hours,
   depending on the job queue, with settings like keeping a minimum number
   of nodes idle, and
2. which nodes that are currently drained/offline it could use to satisfy
   user demand.

I imagine that Slurm upstream is not very keen on hashing out a robust
interface for that. I can see arguments for keeping this wholly
internal to Slurm, but for me, taking nodes in/out of production is not
directly a batch system's task. Obviously the integration of power
saving that involves nodes really being powered down brings
complications like the strange ResumeTimeout behaviour. Also, in the
case of node that have trouble getting back online, the method inside
Slurm provides for a bad user experience:

The nodes are first allocated to the job, and _then_ they are powered
up. In the worst case of a defective node, Slurm will wait for the
whole SuspendTimeout just to realize that it doesn't really have the
resources it just promised to the job, making the job run attempt fail
needlessly.

With my external approach, the handling of bringing a node back up is
done outside slurmctld. Only after a node is back, it is undrained and
jobs will be allocated on it. I use a draining with a specific reason
to mark nodes that are offline due to power saving. What sucks is that
I have to implement part of the scheduler in the sense that I need to
match pending jobs' demands against properties of available nodes.

Maybe the internal powersaving could be made more robust, but I would
rather like to see more separation of concerns than putting everything
into one box. Things are too intertangled, even with my simple concept
of 'job' not beginning to describe what Slurm has in terms of various
steps as scheduling entities that by default also use delayed
allocation techniques (regarding prolog script behaviour, for example).

Alrighty then,

Thomas

-- 
Dr. Thomas Orgis
HPC @ Universität Hamburg