[slurm-users] [External] Hibernating a whole cluster

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Mon Feb 6 20:24:39 UTC 2023


I would agree with Florian about using the Slurm power_save method.

In the Wiki page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving 
there are additional details and scripts for performing node suspend and 
resume.

You would need the server to have a BMC so that you can power it down 
and up using IPMI commands from your Slurm management server.

/Ole


On 06-02-2023 21:07, Florian Zillner wrote:
> follow this guide: https://slurm.schedmd.com/power_save.html 
> <https://slurm.schedmd.com/power_save.html>
> 
> Create poweroff / poweron scripts and configure slurm to do the poweroff 
> after X minutes. Works well for us. Make sure to set an appropriate time 
> (ResumeTimeout) to allow the node to come back to service.
> Note that we did not achieve good power saving with suspending the 
> nodes, powering them off and on saves way more power. The downside is it 
> takes ~ 5 mins to resume (= power on) the nodes when needed.
> 
> Cheers,
> Florian
> ------------------------------------------------------------------------
> *From:* slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of 
> Analabha Roy <hariseldon99 at gmail.com>
> *Sent:* Monday, 6 February 2023 18:21
> *To:* slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
> *Subject:* [External] [slurm-users] Hibernating a whole cluster
> Hi,
> 
> I've just finished  setup of a single node "cluster" with slurm on 
> ubuntu 20.04. Infrastructural limitations  prevent me from running it 
> 24/7, and it's only powered on during business hours.
> 
> 
> Currently, I have a cron job running that hibernates that sole node 
> before closing time.
> 
> The hibernation is done with standard systemd, and hibernates to the 
> swap partition.
> 
>   I have not run any lengthy slurm jobs on it yet. Before I do, can I 
> get some thoughts on a couple of things?
> 
> If it hibernated when slurm still had jobs running/queued, would they 
> resume properly when the machine powers back on?
> 
> Note that my swap space is bigger than my  RAM.
> 
> Is it necessary to perhaps setup a pre-hibernate script for systemd to  
> iterate scontrol to suspend all the jobs before hibernating and resume 
> them post-resume?
> 
> What about the wall times? I'm uessing that slurm will count the 
> downtime as elapsed for each job. Is there a way to config this, or is 
> the only alternative a post-hibernate script that iteratively updates 
> the wall times of the running jobs using scontrol again?




More information about the slurm-users mailing list