<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>That is interesting to me.</p>
    <p>How do you use ulimit and systemd to limit user usage on the
      login nodes? This sounds like something very useful.</p>
    <p>Brian Andrus<br>
    </p>
    <div class="moz-cite-prefix">On 10/31/2021 1:08 AM, Yair Yarom
      wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAAHNG4bVCr0LmFYTZdhEmuhGhvNv64pNtdKAtvu7+zzWXg9weQ@mail.gmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="ltr">
        <div dir="ltr">
          <div>Hi,</div>
          <div><br>
          </div>
          <div>If it helps, this is our setup:</div>
          <div>6 clusters (actually a bit more)<br>
          </div>
          <div>1 mysql + slurmdbd on the same host </div>
          <div>6 primary slurmctld on 3 hosts (need to make sure each
            have a distinct SlurmctldPort)</div>
          <div>6 secondary slurmctld on an arbitrary node on the
            clusters themselves.<br>
          </div>
          <div>1 login node per cluster (this is a very small VM, and
            the users are limited both to cpu time (with ulimit) and
            memory (with systemd))</div>
          <div>The slurm.conf's are shared on nfs to everyone in
            /path/to/nfs/<cluster name>/slurm.conf. With symlink
            to /etc for the relevant cluster per node.<br>
          </div>
          <div><br>
          </div>
          <div>The -M generally works, we can submit/query jobs from a
            login node of one cluster to another. But there's a caveat
            to notice when upgrading. slurmdbd must be upgraded first,
            but usually we have a not so small gap between upgrading the
            different clusters. This causes the -M to stop working
            because binaries of one version won't work on the other (I
            don't remember in which direction).</div>
          <div>We solved this by using an lmod module per cluster, which
            both sets the SLURM_CONF environment, and the PATH to the
            correct slurm binaries (which we install in
            /usr/local/slurm/<version>/ so that they co-exists).
            So when the -M won't work, users can use:</div>
          <div>module load slurm/clusterA</div>
          <div>squeue</div>
          <div>module load slurm/clusterB</div>
          <div>squeue</div>
          <div><br>
          </div>
          <div>BR,<br>
          </div>
          <div><br>
          </div>
          <div><br>
          </div>
          <div><br>
          </div>
          <div><br>
          </div>
          <div><br>
          </div>
          <div><br>
          </div>
        </div>
        <br>
        <div class="gmail_quote">
          <div dir="ltr" class="gmail_attr">On Thu, Oct 28, 2021 at 7:39
            PM navin srivastava <<a
              href="mailto:navin.altair@gmail.com"
              moz-do-not-send="true" class="moz-txt-link-freetext">navin.altair@gmail.com</a>>
            wrote:<br>
          </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px
            0.8ex;border-left:1px solid
            rgb(204,204,204);padding-left:1ex">
            <div dir="auto">Thank you Tina. 
              <div dir="auto">It will really help</div>
              <div dir="auto"><br>
              </div>
              <div dir="auto">Regards </div>
              <div dir="auto">Navin </div>
            </div>
            <br>
            <div class="gmail_quote">
              <div dir="ltr" class="gmail_attr">On Thu, Oct 28, 2021,
                22:01 Tina Friedrich <<a
                  href="mailto:tina.friedrich@it.ox.ac.uk"
                  target="_blank" moz-do-not-send="true"
                  class="moz-txt-link-freetext">tina.friedrich@it.ox.ac.uk</a>>
                wrote:<br>
              </div>
              <blockquote class="gmail_quote" style="margin:0px 0px 0px
                0.8ex;border-left:1px solid
                rgb(204,204,204);padding-left:1ex">Hello,<br>
                <br>
                I have the database on a separate server (it runs the
                database and the <br>
                database only). The login nodes run nothing SLURM
                related, they simply <br>
                have the binaries installed & a SLURM config.<br>
                <br>
                I've never looked into having multiple databases &
                using <br>
                AccountingStorageExternalHost (in fact I'd forgotten you
                could do that), <br>
                so I can't comment on that (maybe someone else can); I
                think that works, <br>
                yes, but as I said never tested that (didn't see much
                point in running <br>
                multiple databases if one would do the job).<br>
                <br>
                I actually have specific login nodes for both of my
                clusters, to make it <br>
                easier for users (especially those with not much
                experience using the <br>
                HPC environment); so I have one login node connecting to
                cluster 1 and <br>
                one connecting to cluster 1.<br>
                <br>
                I think the relevant bits of slurm.conf Relevant config
                entries (if I'm <br>
                not mistaken) on the login nodes are probably:<br>
                <br>
                The differences in the slurm config files (that haven't
                got to do with <br>
                topology & nodes & scheduler tuning) are<br>
                <br>
                ClusterName=cluster1<br>
                ControlMachine=cluster1-slurm<br>
                ControlAddr=/IP_OF_SLURM_CONTROLLER/<br>
                <br>
                ClusterName=cluster2<br>
                ControlMachine=cluster2-slurm<br>
                ControlAddr=/IP_OF_SLURM_CONTROLLER/<br>
                <br>
                (where IP_OF_SLURM_CONTROLLER is the IP address of host
                cluster1-slurm, <br>
                same for cluster2)<br>
                <br>
                And then the have common entries for the
                AccountingStorageHost:<br>
                <br>
                AccountingStorageHost=slurm-db-prod<br>
                AccountingStorageBackupHost=slurm-db-prod<br>
                AccountingStoragePort=7030<br>
                AccountingStorageType=accounting_storage/slurmdbd<br>
                <br>
                (slurm-db-prod is simply the hostname of the SLURM
                database server)<br>
                <br>
                Does that help?<br>
                <br>
                Tina<br>
                <br>
                On 28/10/2021 14:59, navin srivastava wrote:<br>
                > Thank you Tina.<br>
                > <br>
                > so if i understood correctly.Database is global to
                both cluster and <br>
                > running on login Node?<br>
                > or is the database running on one of the master
                Node and shared with <br>
                > another master server Node?<br>
                > <br>
                > but as far I have read that the slurm database can
                also be separate on <br>
                > both the master and just use the parameter <br>
                > AccountingStorageExternalHost so that both
                databases are aware of each <br>
                > other.<br>
                > <br>
                > Also on the login node in slurm .conf file pointed
                to which Slurmctld?<br>
                > is it possible to share the  sample slurm.conf file
                of login Node.<br>
                > <br>
                > Regards<br>
                > Navin.<br>
                > <br>
                > <br>
                > <br>
                > <br>
                > <br>
                > <br>
                > <br>
                > <br>
                > On Thu, Oct 28, 2021 at 7:06 PM Tina Friedrich <br>
                > <<a href="mailto:tina.friedrich@it.ox.ac.uk"
                  rel="noreferrer" target="_blank"
                  moz-do-not-send="true" class="moz-txt-link-freetext">tina.friedrich@it.ox.ac.uk</a>
                <mailto:<a href="mailto:tina.friedrich@it.ox.ac.uk"
                  rel="noreferrer" target="_blank"
                  moz-do-not-send="true" class="moz-txt-link-freetext">tina.friedrich@it.ox.ac.uk</a>>>
                wrote:<br>
                > <br>
                >     Hi Navin,<br>
                > <br>
                >     well, I have two clusters & login nodes
                that allow access to both. That<br>
                >     do? I don't think a third would make any
                difference in setup.<br>
                > <br>
                >     They need to share a database. As long as the
                share a database, the<br>
                >     clusters have 'knowledge' of each other.<br>
                > <br>
                >     So if you set up one database server (running
                slurmdbd), and then a<br>
                >     SLURM controller for each cluster (running
                slurmctld) using that one<br>
                >     central database, the '-M' option should work.<br>
                > <br>
                >     Tina<br>
                > <br>
                >     On 28/10/2021 10:54, navin srivastava wrote:<br>
                >      > Hi ,<br>
                >      ><br>
                >      > I am looking for a stepwise guide to
                setup multi cluster<br>
                >     implementation.<br>
                >      > We wanted to set up 3 clusters and one
                Login Node to run the job<br>
                >     using<br>
                >      > -M cluster option.<br>
                >      > can anybody have such a setup and can
                share some insight into how it<br>
                >      > works and it is really a stable solution.<br>
                >      ><br>
                >      ><br>
                >      > Regards<br>
                >      > Navin.<br>
                > <br>
                >     -- <br>
                >     Tina Friedrich, Advanced Research Computing Snr
                HPC Systems<br>
                >     Administrator<br>
                > <br>
                >     Research Computing and Support Services<br>
                >     IT Services, University of Oxford<br>
                >     <a href="http://www.arc.ox.ac.uk"
                  rel="noreferrer noreferrer" target="_blank"
                  moz-do-not-send="true" class="moz-txt-link-freetext">http://www.arc.ox.ac.uk</a>
                <<a href="http://www.arc.ox.ac.uk" rel="noreferrer
                  noreferrer" target="_blank" moz-do-not-send="true"
                  class="moz-txt-link-freetext">http://www.arc.ox.ac.uk</a>><br>
                >     <a href="http://www.it.ox.ac.uk"
                  rel="noreferrer noreferrer" target="_blank"
                  moz-do-not-send="true" class="moz-txt-link-freetext">http://www.it.ox.ac.uk</a>
                <<a href="http://www.it.ox.ac.uk" rel="noreferrer
                  noreferrer" target="_blank" moz-do-not-send="true"
                  class="moz-txt-link-freetext">http://www.it.ox.ac.uk</a>><br>
                > <br>
                <br>
                -- <br>
                Tina Friedrich, Advanced Research Computing Snr HPC
                Systems Administrator<br>
                <br>
                Research Computing and Support Services<br>
                IT Services, University of Oxford<br>
                <a href="http://www.arc.ox.ac.uk" rel="noreferrer
                  noreferrer" target="_blank" moz-do-not-send="true"
                  class="moz-txt-link-freetext">http://www.arc.ox.ac.uk</a>
                <a href="http://www.it.ox.ac.uk" rel="noreferrer
                  noreferrer" target="_blank" moz-do-not-send="true"
                  class="moz-txt-link-freetext">http://www.it.ox.ac.uk</a><br>
                <br>
              </blockquote>
            </div>
          </blockquote>
        </div>
        <br clear="all">
        <br>
        -- <br>
        <div dir="ltr" class="gmail_signature">
          <div dir="ltr">
            <div>
              <pre style="font-family:monospace">  <span style="color:rgb(133,12,27)">/|</span>       |
  <span style="color:rgb(133,12,27)">\/</span>       | <span style="color:rgb(51,88,104);font-weight:bold">Yair Yarom </span><span style="color:rgb(51,88,104)">| System Group (DevOps)</span>
  <span style="color:rgb(92,181,149)">[]</span>       | <span style="color:rgb(51,88,104);font-weight:bold">The Rachel and Selim Benin School</span>
  <span style="color:rgb(92,181,149)">[]</span> <span style="color:rgb(133,12,27)">/\</span>    | <span style="color:rgb(51,88,104);font-weight:bold">of Computer Science and Engineering</span>
  <span style="color:rgb(92,181,149)">[]</span><span style="color:rgb(0,161,146)">//</span><span style="color:rgb(133,12,27)">\</span><span style="color:rgb(133,12,27)">\</span><span style="color:rgb(49,154,184)">/</span>  | <span style="color:rgb(51,88,104)">The Hebrew University of Jerusalem</span>
  <span style="color:rgb(92,181,149)">[</span><span style="color:rgb(1,84,76)">/</span><span style="color:rgb(0,161,146)">/</span>  <span style="color:rgb(41,16,22)">\</span><span style="color:rgb(41,16,22)">\</span>  | <span style="color:rgb(51,88,104)">T +972-2-5494522 | F +972-2-5494522</span>
  <span style="color:rgb(1,84,76)">//</span>    <span style="color:rgb(21,122,134)">\</span>  | <span style="color:rgb(51,88,104)"><a href="mailto:irush@cs.huji.ac.il" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">irush@cs.huji.ac.il</a></span>
 <span style="color:rgb(127,130,103)">/</span><span style="color:rgb(1,84,76)">/</span>        |
</pre>
            </div>
          </div>
        </div>
      </div>
    </blockquote>
  </body>
</html>