[slurm-users] Setting up a reactivity margin with SLURM
ahmet.mercan at uhem.itu.edu.tr
Mon May 23 09:22:56 UTC 2022
Because of the same reasons as you said, I don't use slurm power saving
features. I want to keep a certain number of nodes always power on and
ready to run. The Slurm settings are very limited, just SuspendExcNodes
and SuspendExcParts parameters are exist. But SuspendExcNodes totally
useless. When you set SuspendExcNodes, these nodes always open and
probably busy. When these nodes are busy, there aren't any idle nodes
for instant run.
We use a cron script to power off and on idle nodes. It keeps a certain
number of idle nodes always open. Also, it decides this certain number
according to prediction of the load of cluster, from the history of the
load of the cluster.
But, there is a problem for this approach: Slurm and the users can not
understand which nodes are down for power saving or other reasons. To
solve this issue, My spart command (https://github.com/mercanca/spart)
which using to show queues (free cpus and nodes info), have a feature
that shows power-saving nodes as idle.
This script is not written for publishing, it is very specific our
environment. But if you want to use (or just to get inspired), I can share.
On 23.05.2022 11:03, Corentin Mercier wrote:
> I am currently trying to make energy savings on a cluster running SLURM.
> I read the Power Saving guide and I found exactly what I am looking
> for : SuspendTime. It allows me to shut nodes down after a certain
> idle time.
> However, I want to go further by keeping a small amount of nodes idle
> in certain partitions in order to allow small jobs to run instantly.
> For short, I want to keep a reactivity margin on certain partitions.
> In the documentation, I saw that it's possible to exclude given nodes
> from shutting down but I want that list to be dynamic and to keep a
> certain amount of nodes idle.
> Here's an example :
> On partition A, there should always be 5 idle nodes available to new
> clients. As clients come, those idle nodes become allocated and new
> nodes need to be started in order to replace them (they'll stay idle
> until allocated).
> I would need to wake some nodes and update the exclusion list so
> they're staying idle .
> As someone else could have faced the same issue, I went on SLURM's
> GitHub to check the available plugins there but I couldn't find any
> that implement a dynamic reactivity margin.
> So, is there a plugin that implements such mechanism ? Or should I
> work with the Suspend/ResumeProgram scripts to update the
> SuspendExcNodes list by hand ?
> I'd be glad to hear any other existing solution too.
More information about the slurm-users