[slurm-users] [External] Hibernating a whole cluster
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Mon Feb 6 20:24:39 UTC 2023
I would agree with Florian about using the Slurm power_save method.
In the Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving
there are additional details and scripts for performing node suspend and
resume.
You would need the server to have a BMC so that you can power it down
and up using IPMI commands from your Slurm management server.
/Ole
On 06-02-2023 21:07, Florian Zillner wrote:
> follow this guide: https://slurm.schedmd.com/power_save.html
> <https://slurm.schedmd.com/power_save.html>
>
> Create poweroff / poweron scripts and configure slurm to do the poweroff
> after X minutes. Works well for us. Make sure to set an appropriate time
> (ResumeTimeout) to allow the node to come back to service.
> Note that we did not achieve good power saving with suspending the
> nodes, powering them off and on saves way more power. The downside is it
> takes ~ 5 mins to resume (= power on) the nodes when needed.
>
> Cheers,
> Florian
> ------------------------------------------------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of
> Analabha Roy <hariseldon99 at gmail.com>
> *Sent:* Monday, 6 February 2023 18:21
> *To:* slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
> *Subject:* [External] [slurm-users] Hibernating a whole cluster
> Hi,
>
> I've just finished setup of a single node "cluster" with slurm on
> ubuntu 20.04. Infrastructural limitations prevent me from running it
> 24/7, and it's only powered on during business hours.
>
>
> Currently, I have a cron job running that hibernates that sole node
> before closing time.
>
> The hibernation is done with standard systemd, and hibernates to the
> swap partition.
>
> I have not run any lengthy slurm jobs on it yet. Before I do, can I
> get some thoughts on a couple of things?
>
> If it hibernated when slurm still had jobs running/queued, would they
> resume properly when the machine powers back on?
>
> Note that my swap space is bigger than my RAM.
>
> Is it necessary to perhaps setup a pre-hibernate script for systemd to
> iterate scontrol to suspend all the jobs before hibernating and resume
> them post-resume?
>
> What about the wall times? I'm uessing that slurm will count the
> downtime as elapsed for each job. Is there a way to config this, or is
> the only alternative a post-hibernate script that iteratively updates
> the wall times of the running jobs using scontrol again?
More information about the slurm-users
mailing list