<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>I agree that people are making updating slurm.conf a bigger issue

      than people are making it out to be. However, there are certain

      config changes that do require restarting the daemon rather than

      just doing 'scontrol reconfigure.' these options are documented in

      the slurm.conf documentation (just search for "restart")<br>

    </p>

    <p>I believe it's often only the slurmctld that needs to be

      restarted, which is one daemon on one system, rather than

      restarting slurmd on all the compute nodes, but there are a few

      that require all slurm daemons being restarted. Adding nodes to a

      cluster is one of them: <br>

    </p>

    <p>

      <blockquote type="cite">Changes in node configuration (e.g. adding

        nodes, changing their

        processor count, etc.) require restarting both the slurmctld

        daemon

        and the slurmd daemons.

        All slurmd daemons must know each node in the system to forward

        messages in support of hierarchical communications</blockquote>

    </p>

    <p>But to avoid this, you can use the future setting to define

      "future" nodes: <br>

    </p>

    <p>

      <blockquote type="cite">

        <dl compact="compact">

          <dt><b>FUTURE</b></dt>

          <dd>

            Indicates the node is defined for future use and need not

            exist when the Slurm daemons are started. These nodes can be

            made available

            for use simply by updating the node state using the scontrol

            command rather

            than restarting the slurmctld daemon. After these nodes are

            made available,

            change their State in the slurm.conf file. Until these nodes

            are made

            available, they will not be seen using any Slurm commands or

            nor will

            any attempt be made to contact them.

          </dd>

        </dl>

      </blockquote>

    </p>

    <p>--<br>

      Prentice<br>

    </p>

    <pre class="moz-signature" cols="72">

</pre>

    <div class="moz-cite-prefix">On 5/4/21 8:32 AM, Sid Young wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAEZ+gOxsa=6gU2MnG8C9Mj8Metq-7drDoa4xSF3cmZ714vfD3A@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div dir="auto">You can push a new conf file and issue an

        "scontrol reconfigure" on the fly as needed... I do it on our

        cluster as needed, do the nodes first then login nodes then the

        slurm controller... you are making a huge issue of a very basic

        task...

        <div dir="auto"><br>

        </div>

        <div dir="auto">Sid</div>

        <div dir="auto"><br>

        </div>

      </div>

    </blockquote>

    <blockquote type="cite"

cite="mid:CAEZ+gOxsa=6gU2MnG8C9Mj8Metq-7drDoa4xSF3cmZ714vfD3A@mail.gmail.com"><br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Tue, 4 May 2021, 22:28 Tina

          Friedrich, <<a href="mailto:tina.friedrich@it.ox.ac.uk"

            target="_blank" rel="noreferrer" moz-do-not-send="true">tina.friedrich@it.ox.ac.uk</a>>

          wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0 0 0

          .8ex;border-left:1px #ccc solid;padding-left:1ex">Hello,<br>

          <br>

          a lot of people already gave very good answer to how to tackle

          this.<br>

          <br>

          Still, I thought it worth pointing this out - you said 'you

          need to <br>

          basically shut down slurm, update the slurm.conf file, then

          restart'. <br>

          That makes it sound like a major operation with lots of prep

          required.<br>

          <br>

          It's not like that at all. Updating slurm.conf is not a major

          operation.<br>

          <br>

          There's absolutely no reason to shut things down first &

          then change the <br>

          file. You can edit the file / ship out a new version (however

          you like) <br>

          and then restart the daemons.<br>

          <br>

          The daemons do not have to all be restarted simultaneously. It

          is of no <br>

          consequence if they're running with out-of-sync config files

          for a bit, <br>

          really. (There's a flag you can set if you want to suppress

          the warning <br>

          - 'NO_CONF_HASH' debug flag I think).<br>

          <br>

          Restarting the dameons (slurmctld, slurmd, ...) is safe. It

          does not <br>

          require cluster downtime or anything.<br>

          <br>

          I control slurm.conf using configuration management; the

          config <br>

          management process restarts the appropriate daemon (slurmctld,

          slurmd, <br>

          slurmdbd) if the file changed. This certainly never happens at

          the same <br>

          time; there's splay in that. It doesn't even necessarily

          happen on the <br>

          controller first, or anything like that.<br>

          <br>

          What I'm trying to get across - I have a feeling this

          'updating the <br>

          cluster wide config file' and 'file must be the same on all

          nodes' is a <br>

          lot less of a procedure (and a lot less strict) than you

          currently <br>

          imagine it to be :)<br>

          <br>

          Tina<br>

          <br>

          On 27/04/2021 19:35, David Henkemeyer wrote:<br>

          > Hello,<br>

          > <br>

          > I'm new to Slurm (coming from PBS), and so I will likely

          have a few <br>

          > questions over the next several weeks, as I work to

          transition my <br>

          > infrastructure from PBS to Slurm.<br>

          > <br>

          > My first question has to do with *_adding nodes to

          Slurm_*.  According <br>

          > to the FAQ (and other articles I've read), you need to

          basically shut <br>

          > down slurm, update the slurm.conf file /*on all nodes in

          the cluster*/, <br>

          > then restart slurm.<br>

          > <br>

          > - Why do all nodes need to know about all other nodes? 

          From what I have <br>

          > read, its Slurm does a checksum comparison of the

          slurm.conf file across <br>

          > all nodes.  Is this the only reason all nodes need to

          know about all <br>

          > other nodes?<br>

          > - Can I create a symlink that

          points <sysconfdir>/slurm.conf to a <br>

          > slurm.conf file on an NFS mount point, which is mounted

          on all the <br>

          > nodes?  This way, I would only need to update a single

          file, then <br>

          > restart Slurm across the entire cluster.<br>

          > - Any additional help/resources for adding/removing nodes

          to Slurm would <br>

          > be much appreciated.  Perhaps there is a "toolkit" out

          there to automate <br>

          > some of these operations (which is what I already have

          for PBS, and will <br>

          > create for Slurm, if something doesn't already exist).<br>

          > <br>

          > Thank you all,<br>

          > <br>

          > David<br>

          <br>

          -- <br>

          Tina Friedrich, Advanced Research Computing Snr HPC Systems

          Administrator<br>

          <br>

          Research Computing and Support Services<br>

          IT Services, University of Oxford<br>

          <a href="http://www.arc.ox.ac.uk" rel="noreferrer noreferrer

            noreferrer" target="_blank" moz-do-not-send="true">http://www.arc.ox.ac.uk</a>

          <a href="http://www.it.ox.ac.uk" rel="noreferrer noreferrer

            noreferrer" target="_blank" moz-do-not-send="true">http://www.it.ox.ac.uk</a><br>

          <br>

        </blockquote>

      </div>

    </blockquote>

  </body>

</html>