<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>@Tina,</p>
<p>Figure slurmd reads the config in ones and runs with it. You
would need to have it recheck regularly to see if there are any
changes. This is exactly what 'scontrol reconfig' does: tells all
the slurm nodes to recheck the config.</p>
<p><br>
</p>
<p>@Steven,</p>
<p>It seems to me you could just have a monitor daemon that keeps
things up-to-date.<br>
It could watch for the alert that AWS sends (2 minute warning,
IIRC) and take appropriate action of drain the node and
cancel/checkpoint a job.<br>
In addition, it could keep an eye on things in the event a warning
wasn't received and a node 'vanishes'. I suspect Nagios even has
the hooks to make that work. You could also email the user to let
them know their job was ended due to spot being pulled.<br>
</p>
<p>Just some ideas,</p>
<p>Brian Andrus<br>
</p>
<div class="moz-cite-prefix">On 5/5/2022 6:28 AM, Steven Varga
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAFkVQ9NGJL3c85LAQT6AWAwaC8NQVRs3ypsoGvDJ3NWV2rbCrw@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="auto">
<div>Hi Tina,
<div dir="auto">Thank you for sharing. This matches my
observations when I checked if slurm could do what I am
upto: manage AWS EC2 dynamic(spot) instances. </div>
<div dir="auto"><br>
</div>
<div dir="auto">After replacing MySQL with REDIS now i wonder
what would it take to make slurm node addition | removal
dynamic. I've been looking at the source code for many
months now and trying to decide if it can be done. </div>
<div dir="auto"><br>
</div>
<div dir="auto">I am using configless, 3 controllers, 2
slurmdbs with a redis sentinel based robust backend. </div>
<div dir="auto"><br>
</div>
<div dir="auto">Steven</div>
<br>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu., May 5, 2022,
08:57 Tina Friedrich, <<a
href="mailto:tina.friedrich@it.ox.ac.uk"
moz-do-not-send="true" class="moz-txt-link-freetext">tina.friedrich@it.ox.ac.uk</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">Hi List,<br>
<br>
out of curiosity - I would assume that if running
configless, one <br>
doesn't manually need to restart slurmd on the nodes if
the config changes?<br>
<br>
Hi Steven,<br>
<br>
I have no idea if you want to do it every couple of
minutes and what the <br>
implications are of that (although I've certainly manage
to restart them <br>
every 5 minutes by accident with no real problems caused),
but - <br>
generally, restarting the daemons (slurmctld, slurmd) is a
non-issue, as <br>
it's a safe operation. There's no risk to running jobs or
anything. I <br>
have the config management restart them if any files
change. It also <br>
doesn't seem to matter if the restarts of the controller
& the node <br>
daemons are splayed a bit (i.e. don't happen at the same
time), or what <br>
order they happen in.<br>
<br>
Tina<br>
<br>
On 05/05/2022 13:17, Steven Varga wrote:<br>
> Thank you for the quick reply! I know I am pushing my
luck here: is it <br>
> possible to modify slurm: src/common/[read_conf.c,
node_conf.c] <br>
> src/slurmctld/[read_config.c, ...] such that the
state can be maintained <br>
> dynamically? -- or cheaper to write a job manager
with less features but <br>
> supporting dynamic nodes from ground up?<br>
> best wishes: steve<br>
> <br>
> On Thu, May 5, 2022 at 12:29 AM Christopher Samuel
<<a href="mailto:chris@csamuel.org" target="_blank"
rel="noreferrer" moz-do-not-send="true"
class="moz-txt-link-freetext">chris@csamuel.org</a> <br>
> <mailto:<a href="mailto:chris@csamuel.org"
target="_blank" rel="noreferrer" moz-do-not-send="true"
class="moz-txt-link-freetext">chris@csamuel.org</a>>>
wrote:<br>
> <br>
> On 5/4/22 7:26 pm, Steven Varga wrote:<br>
> <br>
> > I am wondering what is the best way to
update node changes, such as<br>
> > addition and removal of nodes to SLURM. The
excerpts below suggest a<br>
> > full restart, can someone confirm this?<br>
> <br>
> You are correct, you need to restart slurmctld
and slurmd daemons at<br>
> present. See <a
href="https://slurm.schedmd.com/faq.html#add_nodes"
rel="noreferrer noreferrer" target="_blank"
moz-do-not-send="true" class="moz-txt-link-freetext">https://slurm.schedmd.com/faq.html#add_nodes</a><br>
> <<a
href="https://slurm.schedmd.com/faq.html#add_nodes"
rel="noreferrer noreferrer" target="_blank"
moz-do-not-send="true" class="moz-txt-link-freetext">https://slurm.schedmd.com/faq.html#add_nodes</a>><br>
> <br>
> All the best,<br>
> Chris<br>
> -- <br>
> Chris Samuel : <a
href="http://www.csamuel.org/" rel="noreferrer
noreferrer" target="_blank" moz-do-not-send="true"
class="moz-txt-link-freetext">http://www.csamuel.org/</a>
<<a href="http://www.csamuel.org/" rel="noreferrer
noreferrer" target="_blank" moz-do-not-send="true"
class="moz-txt-link-freetext">http://www.csamuel.org/</a>>
<br>
> : Berkeley, CA, USA<br>
> <br>
<br>
-- <br>
Tina Friedrich, Advanced Research Computing Snr HPC
Systems Administrator<br>
<br>
Research Computing and Support Services<br>
IT Services, University of Oxford<br>
<a href="http://www.arc.ox.ac.uk" rel="noreferrer
noreferrer" target="_blank" moz-do-not-send="true"
class="moz-txt-link-freetext">http://www.arc.ox.ac.uk</a>
<a href="http://www.it.ox.ac.uk" rel="noreferrer
noreferrer" target="_blank" moz-do-not-send="true"
class="moz-txt-link-freetext">http://www.it.ox.ac.uk</a><br>
<br>
</blockquote>
</div>
</div>
</div>
</blockquote>
</body>
</html>