[slurm-users] error: power_save module disabled, NULL SuspendProgram

Mon Mar 27 11:17:01 UTC 2023

Hi Thomas,

FYI: Slurm power_save works very well for us without the issues that you 
describe below.  We run Slurm 22.05.8, what's your version?

I've documented our setup in this Wiki page:
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving
This page contains a link to power_save scripts on GitHub.

IHTH,
Ole

On 3/27/23 12:57, Dr. Thomas Orgis wrote:
> Am Mon, 06 Mar 2023 13:35:38 +0100
> schrieb Stefan Staeglich <staeglis at informatik.uni-freiburg.de>:
> 
>> But this fixed not the main error but might have reduced the frequency of
>> occurring. Has someone observed similar issues? We will try a higher
>> SuspendTimeout.
> 
> We had issues with power saving. We powered the idle nodes off, causing
> a full boot to resume. We observed repeatedly the strange behaviour
> that the node is present for a while, but only detected by slurmctld as
> being ready right when it is giving up with SuspendTimeout.
> 
> But instead of fixing this possibly subtle logic error, we figured that
> 
> a) The node suspend support in Slurm was not really designed for full
>     power off/on, which can take minutes regularily.
> 
> b) This functionality of taking nodes out of/into production is
>     something the cluster admin does. This is not in the scope of the
>     batch system.
> 
> Hence I wrote a script that runs as a service on a superior admin node.
> It queries Slurm for idle nodes and pending jobs and then decides which
> nodes to drain and then power down or bring back online.
> 
> This needs more knowledge on Slurm job and node states than I'd like,
> but it works. Ideally, I'd like the powersaving feature of slurm
> consisting of a simple interface that can communicate
> 
> 1. which nodes are probably not needed in the coming x minutes/hours,
>     depending on the job queue, with settings like keeping a minimum number
>     of nodes idle, and
> 2. which nodes that are currently drained/offline it could use to satisfy
>     user demand.
> 
> I imagine that Slurm upstream is not very keen on hashing out a robust
> interface for that. I can see arguments for keeping this wholly
> internal to Slurm, but for me, taking nodes in/out of production is not
> directly a batch system's task. Obviously the integration of power
> saving that involves nodes really being powered down brings
> complications like the strange ResumeTimeout behaviour. Also, in the
> case of node that have trouble getting back online, the method inside
> Slurm provides for a bad user experience:
> 
> The nodes are first allocated to the job, and _then_ they are powered
> up. In the worst case of a defective node, Slurm will wait for the
> whole SuspendTimeout just to realize that it doesn't really have the
> resources it just promised to the job, making the job run attempt fail
> needlessly.
> 
> With my external approach, the handling of bringing a node back up is
> done outside slurmctld. Only after a node is back, it is undrained and
> jobs will be allocated on it. I use a draining with a specific reason
> to mark nodes that are offline due to power saving. What sucks is that
> I have to implement part of the scheduler in the sense that I need to
> match pending jobs' demands against properties of available nodes.
> 
> Maybe the internal powersaving could be made more robust, but I would
> rather like to see more separation of concerns than putting everything
> into one box. Things are too intertangled, even with my simple concept
> of 'job' not beginning to describe what Slurm has in terms of various
> steps as scheduling entities that by default also use delayed
> allocation techniques (regarding prolog script behaviour, for example).
> 
> 
> Alrighty then,
> 
> Thomas
> 

-- 
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark