[slurm-users] [External] Hibernating a whole cluster

Tue Feb 7 12:42:27 UTC 2023

RAM used by a suspended job is not released. At most it can be swapped 
out (if enough swap is available).

Il 07/02/2023 13:14, Analabha Roy ha scritto:
> Hi Sean,
> 
> Thanks for your awesome suggestion! I'm going through the reservation 
> docs now. At first glance, it seems like a daily reservation would turn 
> down jobs that are too big for the reservation. It'd be nice if
> slurm could suspend (in the manner of 'scontrol suspend') jobs during 
> reserved downtime and resume them after. That way, folks can submit 
> large jobs without having to worry about the downtimes. Perhaps the FLEX 
> option in reservations can accomplish this somehow?
> 
> 
> I suppose that I can do it using a shell script iterator and a cron job, 
> but that seems like an ugly hack. I was hoping if there is a way to 
> config this in slurm itself?
> 
> AR
> 
> On Tue, 7 Feb 2023 at 16:06, Sean Mc Grath <smcgrat at tcd.ie 
> <mailto:smcgrat at tcd.ie>> wrote:
> 
>     Hi Analabha,
> 
>     Could you do something like create a daily reservation for 8 hours
>     that starts at 9am, or whatever times work for you like the
>     following untested command:
> 
>     scontrol create reservation starttime=09:00:00 duration=8:00:00
>     nodecnt=1 flags=daily ReservationName=daily
> 
>     Daily option at https://slurm.schedmd.com/scontrol.html#OPT_DAILY
>     <https://slurm.schedmd.com/scontrol.html#OPT_DAILY>
> 
>     Some more possible helpful documentation at
>     https://slurm.schedmd.com/reservations.html
>     <https://slurm.schedmd.com/reservations.html>, search for "daily".
> 
>     My idea being that jobs can only run in that reservation, (that
>     would have to be configured separately, not sure how from the top of
>     my head), which is only active during the times you want the node to
>     be working. So the cronjob that hibernates/shuts it down will do so
>     when there are no jobs running. At least in theory.
> 
>     Hope that helps.
> 
>     Sean
> 
>     ---
>     Sean McGrath
>     Senior Systems Administrator, IT Services
> 
>     ------------------------------------------------------------------------
>     *From:* slurm-users <slurm-users-bounces at lists.schedmd.com
>     <mailto:slurm-users-bounces at lists.schedmd.com>> on behalf of
>     Analabha Roy <hariseldon99 at gmail.com <mailto:hariseldon99 at gmail.com>>
>     *Sent:* Tuesday 7 February 2023 10:05
>     *To:* Slurm User Community List <slurm-users at lists.schedmd.com
>     <mailto:slurm-users at lists.schedmd.com>>
>     *Subject:* Re: [slurm-users] [External] Hibernating a whole cluster
>     Hi,
> 
>     Thanks. I had read the Slurm Power Saving Guide before. I believe
>     the configs enable slurmctld to check other nodes for idleness and
>     suspend/resume them. Slurmctld must run on a separate, always-on
>     server for this to work, right?
> 
>     My issue might be a little different. I literally have only one node
>     that runs everything: slurmctld, slurmd, slurmdbd, everything.
> 
>     This node must be set to "sudo systemctl hibernate"after business
>     hours, regardless of whether jobs are queued or running. The next
>     business day, it can be switched on manually.
> 
>     systemctl hibernate is supposed to save the entire run state of the
>     sole node to swap and poweroff. When powered on again, it should
>     restore everything to its previous running state.
> 
>     When the job queue is empty, this works well. I'm not sure how well
>     this hibernate/resume will work with running jobs and would
>     appreciate any suggestions or insights.
> 
>     AR
> 
> 
>     On Tue, 7 Feb 2023 at 01:39, Florian Zillner <fzillner at lenovo.com
>     <mailto:fzillner at lenovo.com>> wrote:
> 
>         Hi,
> 
>         follow this guide: https://slurm.schedmd.com/power_save.html
>         <https://slurm.schedmd.com/power_save.html>
> 
>         Create poweroff / poweron scripts and configure slurm to do the
>         poweroff after X minutes. Works well for us. Make sure to set an
>         appropriate time (ResumeTimeout) to allow the node to come back
>         to service.
>         Note that we did not achieve good power saving with suspending
>         the nodes, powering them off and on saves way more power. The
>         downside is it takes ~ 5 mins to resume (= power on) the nodes
>         when needed.
> 
>         Cheers,
>         Florian
>         ------------------------------------------------------------------------
>         *From:* slurm-users <slurm-users-bounces at lists.schedmd.com
>         <mailto:slurm-users-bounces at lists.schedmd.com>> on behalf of
>         Analabha Roy <hariseldon99 at gmail.com
>         <mailto:hariseldon99 at gmail.com>>
>         *Sent:* Monday, 6 February 2023 18:21
>         *To:* slurm-users at lists.schedmd.com
>         <mailto:slurm-users at lists.schedmd.com>
>         <slurm-users at lists.schedmd.com
>         <mailto:slurm-users at lists.schedmd.com>>
>         *Subject:* [External] [slurm-users] Hibernating a whole cluster
>         Hi,
> 
>         I've just finished  setup of a single node "cluster" with slurm
>         on ubuntu 20.04. Infrastructural limitations  prevent me from
>         running it 24/7, and it's only powered on during business hours.
> 
> 
>         Currently, I have a cron job running that hibernates that sole
>         node before closing time.
> 
>         The hibernation is done with standard systemd, and hibernates to
>         the swap partition.
> 
>           I have not run any lengthy slurm jobs on it yet. Before I do,
>         can I get some thoughts on a couple of things?
> 
>         If it hibernated when slurm still had jobs running/queued, would
>         they resume properly when the machine powers back on?
> 
>         Note that my swap space is bigger than my  RAM.
> 
>         Is it necessary to perhaps setup a pre-hibernate script for
>         systemd to  iterate scontrol to suspend all the jobs before
>         hibernating and resume them post-resume?
> 
>         What about the wall times? I'm uessing that slurm will count the
>         downtime as elapsed for each job. Is there a way to config this,
>         or is the only alternative a post-hibernate script that
>         iteratively updates the wall times of the running jobs using
>         scontrol again?
> 
>         Thanks for your attention.
>         Regards
>         AR
> 
> 
> 
>     -- 
>     Analabha Roy
>     Assistant Professor
>     Department of Physics
>     <http://www.buruniv.ac.in/academics/department/physics>
>     The University of Burdwan <http://www.buruniv.ac.in/>
>     Golapbag Campus, Barddhaman 713104
>     West Bengal, India
>     Emails: daneel at utexas.edu <mailto:daneel at utexas.edu>,
>     aroy at phys.buruniv.ac.in <mailto:aroy at phys.buruniv.ac.in>,
>     hariseldon99 at gmail.com <mailto:hariseldon99 at gmail.com>
>     Webpage: http://www.ph.utexas.edu/~daneel/
>     <http://www.ph.utexas.edu/~daneel/>
> 
> 
> 
> -- 
> Analabha Roy
> Assistant Professor
> Department of Physics 
> <http://www.buruniv.ac.in/academics/department/physics>
> The University of Burdwan <http://www.buruniv.ac.in/>
> Golapbag Campus, Barddhaman 713104
> West Bengal, India
> Emails: daneel at utexas.edu <mailto:daneel at utexas.edu>, 
> aroy at phys.buruniv.ac.in <mailto:aroy at phys.buruniv.ac.in>, 
> hariseldon99 at gmail.com <mailto:hariseldon99 at gmail.com>
> Webpage: http://www.ph.utexas.edu/~daneel/ 
> <http://www.ph.utexas.edu/~daneel/>

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786