[slurm-users] [External] Hibernating a whole cluster

Analabha Roy hariseldon99 at gmail.com
Tue Feb 7 10:05:32 UTC 2023


Hi,

Thanks. I had read the Slurm Power Saving Guide before. I believe the
configs enable slurmctld to check other nodes for idleness and
suspend/resume them. Slurmctld must run on a separate, always-on server for
this to work, right?

My issue might be a little different. I literally have only one node that
runs everything: slurmctld, slurmd, slurmdbd, everything.

This node must be set to "sudo systemctl hibernate"after business hours,
regardless of whether jobs are queued or running. The next business day, it
can be switched on manually.

systemctl hibernate is supposed to save the entire run state of the sole
node to swap and poweroff. When powered on again, it should restore
everything to its previous running state.

When the job queue is empty, this works well. I'm not sure how well this
hibernate/resume will work with running jobs and would appreciate any
suggestions or insights.

AR


On Tue, 7 Feb 2023 at 01:39, Florian Zillner <fzillner at lenovo.com> wrote:

> Hi,
>
> follow this guide: https://slurm.schedmd.com/power_save.html
>
> Create poweroff / poweron scripts and configure slurm to do the poweroff
> after X minutes. Works well for us. Make sure to set an appropriate time
> (ResumeTimeout) to allow the node to come back to service.
> Note that we did not achieve good power saving with suspending the nodes,
> powering them off and on saves way more power. The downside is it takes ~ 5
> mins to resume (= power on) the nodes when needed.
>
> Cheers,
> Florian
> ------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Analabha Roy <hariseldon99 at gmail.com>
> *Sent:* Monday, 6 February 2023 18:21
> *To:* slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
> *Subject:* [External] [slurm-users] Hibernating a whole cluster
>
> Hi,
>
> I've just finished  setup of a single node "cluster" with slurm on ubuntu
> 20.04. Infrastructural limitations  prevent me from running it 24/7, and
> it's only powered on during business hours.
>
>
> Currently, I have a cron job running that hibernates that sole node before
> closing time.
>
> The hibernation is done with standard systemd, and hibernates to the swap
> partition.
>
>  I have not run any lengthy slurm jobs on it yet. Before I do, can I get
> some thoughts on a couple of things?
>
> If it hibernated when slurm still had jobs running/queued, would they
> resume properly when the machine powers back on?
>
> Note that my swap space is bigger than my  RAM.
>
> Is it necessary to perhaps setup a pre-hibernate script for systemd to
> iterate scontrol to suspend all the jobs before hibernating and resume them
> post-resume?
>
> What about the wall times? I'm uessing that slurm will count the downtime
> as elapsed for each job. Is there a way to config this, or is the only
> alternative a post-hibernate script that iteratively updates the wall times
> of the running jobs using scontrol again?
>
> Thanks for your attention.
> Regards
> AR
>


-- 
Analabha Roy
Assistant Professor
Department of Physics
<http://www.buruniv.ac.in/academics/department/physics>
The University of Burdwan <http://www.buruniv.ac.in/>
Golapbag Campus, Barddhaman 713104
West Bengal, India
Emails: daneel at utexas.edu, aroy at phys.buruniv.ac.in, hariseldon99 at gmail.com
Webpage: http://www.ph.utexas.edu/~daneel/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230207/131da391/attachment.htm>


More information about the slurm-users mailing list