<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>@Tina,</p>
    <p>Figure slurmd reads the config in ones and runs with it. You
      would need to have it recheck regularly to see if there are any
      changes. This is exactly what 'scontrol reconfig' does: tells all
      the slurm nodes to recheck the config.</p>
    <p><br>
    </p>
    <p>@Steven,</p>
    <p>It seems to me you could just have a monitor daemon that keeps
      things up-to-date.<br>
      It could watch for the alert that AWS sends (2 minute warning,
      IIRC) and take appropriate action of drain the node and
      cancel/checkpoint a job.<br>
      In addition, it could keep an eye on things in the event a warning
      wasn't received and a node 'vanishes'.  I suspect Nagios even has
      the hooks to make that work. You could also email the user to let
      them know their job was ended due to spot being pulled.<br>
    </p>
    <p>Just some ideas,</p>
    <p>Brian Andrus<br>
    </p>
    <div class="moz-cite-prefix">On 5/5/2022 6:28 AM, Steven Varga
      wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAFkVQ9NGJL3c85LAQT6AWAwaC8NQVRs3ypsoGvDJ3NWV2rbCrw@mail.gmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="auto">
        <div>Hi Tina,
          <div dir="auto">Thank you for sharing. This matches my
            observations when I checked if slurm could do what I am
            upto: manage AWS EC2 dynamic(spot) instances. </div>
          <div dir="auto"><br>
          </div>
          <div dir="auto">After replacing MySQL with REDIS now i wonder
            what would it take to make slurm node addition | removal
            dynamic. I've been looking at the source code for many
            months now and trying to decide if it can be done. </div>
          <div dir="auto"><br>
          </div>
          <div dir="auto">I am using configless, 3 controllers, 2
            slurmdbs with a redis sentinel based robust backend. </div>
          <div dir="auto"><br>
          </div>
          <div dir="auto">Steven</div>
          <br>
          <br>
          <div class="gmail_quote">
            <div dir="ltr" class="gmail_attr">On Thu., May 5, 2022,
              08:57 Tina Friedrich, <<a
                href="mailto:tina.friedrich@it.ox.ac.uk"
                moz-do-not-send="true" class="moz-txt-link-freetext">tina.friedrich@it.ox.ac.uk</a>>
              wrote:<br>
            </div>
            <blockquote class="gmail_quote" style="margin:0 0 0
              .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi List,<br>
              <br>
              out of curiosity - I would assume that if running
              configless, one <br>
              doesn't manually need to restart slurmd on the nodes if
              the config changes?<br>
              <br>
              Hi Steven,<br>
              <br>
              I have no idea if you want to do it every couple of
              minutes and what the <br>
              implications are of that (although I've certainly manage
              to restart them <br>
              every 5 minutes by accident with no real problems caused),
              but - <br>
              generally, restarting the daemons (slurmctld, slurmd) is a
              non-issue, as <br>
              it's a safe operation. There's no risk to running jobs or
              anything. I <br>
              have the config management restart them if any files
              change. It also <br>
              doesn't seem to matter if the restarts of the controller
              & the node <br>
              daemons are splayed a bit (i.e. don't happen at the same
              time), or what <br>
              order they happen in.<br>
              <br>
              Tina<br>
              <br>
              On 05/05/2022 13:17, Steven Varga wrote:<br>
              > Thank you for the quick reply! I know I am pushing my
              luck here: is it <br>
              > possible to modify slurm: src/common/[read_conf.c,
              node_conf.c]  <br>
              > src/slurmctld/[read_config.c, ...] such that the
              state can be maintained <br>
              > dynamically? -- or cheaper to write a job manager
              with less features but <br>
              > supporting dynamic nodes from ground up?<br>
              > best wishes: steve<br>
              > <br>
              > On Thu, May 5, 2022 at 12:29 AM Christopher Samuel
              <<a href="mailto:chris@csamuel.org" target="_blank"
                rel="noreferrer" moz-do-not-send="true"
                class="moz-txt-link-freetext">chris@csamuel.org</a> <br>
              > <mailto:<a href="mailto:chris@csamuel.org"
                target="_blank" rel="noreferrer" moz-do-not-send="true"
                class="moz-txt-link-freetext">chris@csamuel.org</a>>>
              wrote:<br>
              > <br>
              >     On 5/4/22 7:26 pm, Steven Varga wrote:<br>
              > <br>
              >      > I am wondering what is the best way to
              update node changes, such as<br>
              >      > addition and removal of nodes to SLURM. The
              excerpts below suggest a<br>
              >      > full restart, can someone confirm this?<br>
              > <br>
              >     You are correct, you need to restart slurmctld
              and slurmd daemons at<br>
              >     present.  See <a
                href="https://slurm.schedmd.com/faq.html#add_nodes"
                rel="noreferrer noreferrer" target="_blank"
                moz-do-not-send="true" class="moz-txt-link-freetext">https://slurm.schedmd.com/faq.html#add_nodes</a><br>
              >     <<a
                href="https://slurm.schedmd.com/faq.html#add_nodes"
                rel="noreferrer noreferrer" target="_blank"
                moz-do-not-send="true" class="moz-txt-link-freetext">https://slurm.schedmd.com/faq.html#add_nodes</a>><br>
              > <br>
              >     All the best,<br>
              >     Chris<br>
              >     -- <br>
              >     Chris Samuel  : <a
                href="http://www.csamuel.org/" rel="noreferrer
                noreferrer" target="_blank" moz-do-not-send="true"
                class="moz-txt-link-freetext">http://www.csamuel.org/</a>
              <<a href="http://www.csamuel.org/" rel="noreferrer
                noreferrer" target="_blank" moz-do-not-send="true"
                class="moz-txt-link-freetext">http://www.csamuel.org/</a>>
              <br>
              >     :  Berkeley, CA, USA<br>
              > <br>
              <br>
              -- <br>
              Tina Friedrich, Advanced Research Computing Snr HPC
              Systems Administrator<br>
              <br>
              Research Computing and Support Services<br>
              IT Services, University of Oxford<br>
              <a href="http://www.arc.ox.ac.uk" rel="noreferrer
                noreferrer" target="_blank" moz-do-not-send="true"
                class="moz-txt-link-freetext">http://www.arc.ox.ac.uk</a>
              <a href="http://www.it.ox.ac.uk" rel="noreferrer
                noreferrer" target="_blank" moz-do-not-send="true"
                class="moz-txt-link-freetext">http://www.it.ox.ac.uk</a><br>
              <br>
            </blockquote>
          </div>
        </div>
      </div>
    </blockquote>
  </body>
</html>