[slurm-users] [External] Hibernating a whole cluster
Diego Zuccato
diego.zuccato at unibo.it
Tue Feb 7 13:55:46 UTC 2023
That's probably not optimal, but could work. I'd go with brutal
preemption: swapping 90+G can be quite time-consuming.
Diego
Il 07/02/2023 14:18, Analabha Roy ha scritto:
>
>
> On Tue, 7 Feb 2023, 18:12 Diego Zuccato, <diego.zuccato at unibo.it
> <mailto:diego.zuccato at unibo.it>> wrote:
>
> RAM used by a suspended job is not released. At most it can be swapped
> out (if enough swap is available).
>
>
>
> There should be enough swap available. I have 93 gigs of Ram and as big
> a swap partition. I can top it off with swap files if needed.
>
>
>
>
>
> Il 07/02/2023 13:14, Analabha Roy ha scritto:
> > Hi Sean,
> >
> > Thanks for your awesome suggestion! I'm going through the
> reservation
> > docs now. At first glance, it seems like a daily reservation
> would turn
> > down jobs that are too big for the reservation. It'd be nice if
> > slurm could suspend (in the manner of 'scontrol suspend') jobs
> during
> > reserved downtime and resume them after. That way, folks can submit
> > large jobs without having to worry about the downtimes. Perhaps
> the FLEX
> > option in reservations can accomplish this somehow?
> >
> >
> > I suppose that I can do it using a shell script iterator and a
> cron job,
> > but that seems like an ugly hack. I was hoping if there is a way to
> > config this in slurm itself?
> >
> > AR
> >
> > On Tue, 7 Feb 2023 at 16:06, Sean Mc Grath <smcgrat at tcd.ie
> <mailto:smcgrat at tcd.ie>
> > <mailto:smcgrat at tcd.ie <mailto:smcgrat at tcd.ie>>> wrote:
> >
> > Hi Analabha,
> >
> > Could you do something like create a daily reservation for 8
> hours
> > that starts at 9am, or whatever times work for you like the
> > following untested command:
> >
> > scontrol create reservation starttime=09:00:00 duration=8:00:00
> > nodecnt=1 flags=daily ReservationName=daily
> >
> > Daily option at
> https://slurm.schedmd.com/scontrol.html#OPT_DAILY
> <https://slurm.schedmd.com/scontrol.html#OPT_DAILY>
> > <https://slurm.schedmd.com/scontrol.html#OPT_DAILY
> <https://slurm.schedmd.com/scontrol.html#OPT_DAILY>>
> >
> > Some more possible helpful documentation at
> > https://slurm.schedmd.com/reservations.html
> <https://slurm.schedmd.com/reservations.html>
> > <https://slurm.schedmd.com/reservations.html
> <https://slurm.schedmd.com/reservations.html>>, search for "daily".
> >
> > My idea being that jobs can only run in that reservation, (that
> > would have to be configured separately, not sure how from the
> top of
> > my head), which is only active during the times you want the
> node to
> > be working. So the cronjob that hibernates/shuts it down will
> do so
> > when there are no jobs running. At least in theory.
> >
> > Hope that helps.
> >
> > Sean
> >
> > ---
> > Sean McGrath
> > Senior Systems Administrator, IT Services
> >
> >
> ------------------------------------------------------------------------
> > *From:* slurm-users <slurm-users-bounces at lists.schedmd.com
> <mailto:slurm-users-bounces at lists.schedmd.com>
> > <mailto:slurm-users-bounces at lists.schedmd.com
> <mailto:slurm-users-bounces at lists.schedmd.com>>> on behalf of
> > Analabha Roy <hariseldon99 at gmail.com
> <mailto:hariseldon99 at gmail.com> <mailto:hariseldon99 at gmail.com
> <mailto:hariseldon99 at gmail.com>>>
> > *Sent:* Tuesday 7 February 2023 10:05
> > *To:* Slurm User Community List
> <slurm-users at lists.schedmd.com <mailto:slurm-users at lists.schedmd.com>
> > <mailto:slurm-users at lists.schedmd.com
> <mailto:slurm-users at lists.schedmd.com>>>
> > *Subject:* Re: [slurm-users] [External] Hibernating a whole
> cluster
> > Hi,
> >
> > Thanks. I had read the Slurm Power Saving Guide before. I believe
> > the configs enable slurmctld to check other nodes for
> idleness and
> > suspend/resume them. Slurmctld must run on a separate, always-on
> > server for this to work, right?
> >
> > My issue might be a little different. I literally have only
> one node
> > that runs everything: slurmctld, slurmd, slurmdbd, everything.
> >
> > This node must be set to "sudo systemctl hibernate"after business
> > hours, regardless of whether jobs are queued or running. The next
> > business day, it can be switched on manually.
> >
> > systemctl hibernate is supposed to save the entire run state
> of the
> > sole node to swap and poweroff. When powered on again, it should
> > restore everything to its previous running state.
> >
> > When the job queue is empty, this works well. I'm not sure
> how well
> > this hibernate/resume will work with running jobs and would
> > appreciate any suggestions or insights.
> >
> > AR
> >
> >
> > On Tue, 7 Feb 2023 at 01:39, Florian Zillner
> <fzillner at lenovo.com <mailto:fzillner at lenovo.com>
> > <mailto:fzillner at lenovo.com <mailto:fzillner at lenovo.com>>> wrote:
> >
> > Hi,
> >
> > follow this guide:
> https://slurm.schedmd.com/power_save.html
> <https://slurm.schedmd.com/power_save.html>
> > <https://slurm.schedmd.com/power_save.html
> <https://slurm.schedmd.com/power_save.html>>
> >
> > Create poweroff / poweron scripts and configure slurm to
> do the
> > poweroff after X minutes. Works well for us. Make sure to
> set an
> > appropriate time (ResumeTimeout) to allow the node to
> come back
> > to service.
> > Note that we did not achieve good power saving with
> suspending
> > the nodes, powering them off and on saves way more power. The
> > downside is it takes ~ 5 mins to resume (= power on) the
> nodes
> > when needed.
> >
> > Cheers,
> > Florian
> >
> ------------------------------------------------------------------------
> > *From:* slurm-users
> <slurm-users-bounces at lists.schedmd.com
> <mailto:slurm-users-bounces at lists.schedmd.com>
> > <mailto:slurm-users-bounces at lists.schedmd.com
> <mailto:slurm-users-bounces at lists.schedmd.com>>> on behalf of
> > Analabha Roy <hariseldon99 at gmail.com
> <mailto:hariseldon99 at gmail.com>
> > <mailto:hariseldon99 at gmail.com
> <mailto:hariseldon99 at gmail.com>>>
> > *Sent:* Monday, 6 February 2023 18:21
> > *To:* slurm-users at lists.schedmd.com
> <mailto:slurm-users at lists.schedmd.com>
> > <mailto:slurm-users at lists.schedmd.com
> <mailto:slurm-users at lists.schedmd.com>>
> > <slurm-users at lists.schedmd.com
> <mailto:slurm-users at lists.schedmd.com>
> > <mailto:slurm-users at lists.schedmd.com
> <mailto:slurm-users at lists.schedmd.com>>>
> > *Subject:* [External] [slurm-users] Hibernating a whole
> cluster
> > Hi,
> >
> > I've just finished setup of a single node "cluster" with
> slurm
> > on ubuntu 20.04. Infrastructural limitations prevent me from
> > running it 24/7, and it's only powered on during
> business hours.
> >
> >
> > Currently, I have a cron job running that hibernates that
> sole
> > node before closing time.
> >
> > The hibernation is done with standard systemd, and
> hibernates to
> > the swap partition.
> >
> > I have not run any lengthy slurm jobs on it yet. Before
> I do,
> > can I get some thoughts on a couple of things?
> >
> > If it hibernated when slurm still had jobs
> running/queued, would
> > they resume properly when the machine powers back on?
> >
> > Note that my swap space is bigger than my RAM.
> >
> > Is it necessary to perhaps setup a pre-hibernate script for
> > systemd to iterate scontrol to suspend all the jobs before
> > hibernating and resume them post-resume?
> >
> > What about the wall times? I'm uessing that slurm will
> count the
> > downtime as elapsed for each job. Is there a way to
> config this,
> > or is the only alternative a post-hibernate script that
> > iteratively updates the wall times of the running jobs using
> > scontrol again?
> >
> > Thanks for your attention.
> > Regards
> > AR
> >
> >
> >
> > --
> > Analabha Roy
> > Assistant Professor
> > Department of Physics
> > <http://www.buruniv.ac.in/academics/department/physics
> <http://www.buruniv.ac.in/academics/department/physics>>
> > The University of Burdwan <http://www.buruniv.ac.in/
> <http://www.buruniv.ac.in/>>
> > Golapbag Campus, Barddhaman 713104
> > West Bengal, India
> > Emails: daneel at utexas.edu <mailto:daneel at utexas.edu>
> <mailto:daneel at utexas.edu <mailto:daneel at utexas.edu>>,
> > aroy at phys.buruniv.ac.in <mailto:aroy at phys.buruniv.ac.in>
> <mailto:aroy at phys.buruniv.ac.in <mailto:aroy at phys.buruniv.ac.in>>,
> > hariseldon99 at gmail.com <mailto:hariseldon99 at gmail.com>
> <mailto:hariseldon99 at gmail.com <mailto:hariseldon99 at gmail.com>>
> > Webpage: http://www.ph.utexas.edu/~daneel/
> <http://www.ph.utexas.edu/~daneel/>
> > <http://www.ph.utexas.edu/~daneel/
> <http://www.ph.utexas.edu/~daneel/>>
> >
> >
> >
> > --
> > Analabha Roy
> > Assistant Professor
> > Department of Physics
> > <http://www.buruniv.ac.in/academics/department/physics
> <http://www.buruniv.ac.in/academics/department/physics>>
> > The University of Burdwan <http://www.buruniv.ac.in/
> <http://www.buruniv.ac.in/>>
> > Golapbag Campus, Barddhaman 713104
> > West Bengal, India
> > Emails: daneel at utexas.edu <mailto:daneel at utexas.edu>
> <mailto:daneel at utexas.edu <mailto:daneel at utexas.edu>>,
> > aroy at phys.buruniv.ac.in <mailto:aroy at phys.buruniv.ac.in>
> <mailto:aroy at phys.buruniv.ac.in <mailto:aroy at phys.buruniv.ac.in>>,
> > hariseldon99 at gmail.com <mailto:hariseldon99 at gmail.com>
> <mailto:hariseldon99 at gmail.com <mailto:hariseldon99 at gmail.com>>
> > Webpage: http://www.ph.utexas.edu/~daneel/
> <http://www.ph.utexas.edu/~daneel/>
> > <http://www.ph.utexas.edu/~daneel/
> <http://www.ph.utexas.edu/~daneel/>>
>
> --
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786
>
--
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786
More information about the slurm-users
mailing list