<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>That is interesting to me.</p>
<p>How do you use ulimit and systemd to limit user usage on the
login nodes? This sounds like something very useful.</p>
<p>Brian Andrus<br>
</p>
<div class="moz-cite-prefix">On 10/31/2021 1:08 AM, Yair Yarom
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAAHNG4bVCr0LmFYTZdhEmuhGhvNv64pNtdKAtvu7+zzWXg9weQ@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">
<div dir="ltr">
<div>Hi,</div>
<div><br>
</div>
<div>If it helps, this is our setup:</div>
<div>6 clusters (actually a bit more)<br>
</div>
<div>1 mysql + slurmdbd on the same host </div>
<div>6 primary slurmctld on 3 hosts (need to make sure each
have a distinct SlurmctldPort)</div>
<div>6 secondary slurmctld on an arbitrary node on the
clusters themselves.<br>
</div>
<div>1 login node per cluster (this is a very small VM, and
the users are limited both to cpu time (with ulimit) and
memory (with systemd))</div>
<div>The slurm.conf's are shared on nfs to everyone in
/path/to/nfs/<cluster name>/slurm.conf. With symlink
to /etc for the relevant cluster per node.<br>
</div>
<div><br>
</div>
<div>The -M generally works, we can submit/query jobs from a
login node of one cluster to another. But there's a caveat
to notice when upgrading. slurmdbd must be upgraded first,
but usually we have a not so small gap between upgrading the
different clusters. This causes the -M to stop working
because binaries of one version won't work on the other (I
don't remember in which direction).</div>
<div>We solved this by using an lmod module per cluster, which
both sets the SLURM_CONF environment, and the PATH to the
correct slurm binaries (which we install in
/usr/local/slurm/<version>/ so that they co-exists).
So when the -M won't work, users can use:</div>
<div>module load slurm/clusterA</div>
<div>squeue</div>
<div>module load slurm/clusterB</div>
<div>squeue</div>
<div><br>
</div>
<div>BR,<br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, Oct 28, 2021 at 7:39
PM navin srivastava <<a
href="mailto:navin.altair@gmail.com"
moz-do-not-send="true" class="moz-txt-link-freetext">navin.altair@gmail.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div dir="auto">Thank you Tina.
<div dir="auto">It will really help</div>
<div dir="auto"><br>
</div>
<div dir="auto">Regards </div>
<div dir="auto">Navin </div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, Oct 28, 2021,
22:01 Tina Friedrich <<a
href="mailto:tina.friedrich@it.ox.ac.uk"
target="_blank" moz-do-not-send="true"
class="moz-txt-link-freetext">tina.friedrich@it.ox.ac.uk</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">Hello,<br>
<br>
I have the database on a separate server (it runs the
database and the <br>
database only). The login nodes run nothing SLURM
related, they simply <br>
have the binaries installed & a SLURM config.<br>
<br>
I've never looked into having multiple databases &
using <br>
AccountingStorageExternalHost (in fact I'd forgotten you
could do that), <br>
so I can't comment on that (maybe someone else can); I
think that works, <br>
yes, but as I said never tested that (didn't see much
point in running <br>
multiple databases if one would do the job).<br>
<br>
I actually have specific login nodes for both of my
clusters, to make it <br>
easier for users (especially those with not much
experience using the <br>
HPC environment); so I have one login node connecting to
cluster 1 and <br>
one connecting to cluster 1.<br>
<br>
I think the relevant bits of slurm.conf Relevant config
entries (if I'm <br>
not mistaken) on the login nodes are probably:<br>
<br>
The differences in the slurm config files (that haven't
got to do with <br>
topology & nodes & scheduler tuning) are<br>
<br>
ClusterName=cluster1<br>
ControlMachine=cluster1-slurm<br>
ControlAddr=/IP_OF_SLURM_CONTROLLER/<br>
<br>
ClusterName=cluster2<br>
ControlMachine=cluster2-slurm<br>
ControlAddr=/IP_OF_SLURM_CONTROLLER/<br>
<br>
(where IP_OF_SLURM_CONTROLLER is the IP address of host
cluster1-slurm, <br>
same for cluster2)<br>
<br>
And then the have common entries for the
AccountingStorageHost:<br>
<br>
AccountingStorageHost=slurm-db-prod<br>
AccountingStorageBackupHost=slurm-db-prod<br>
AccountingStoragePort=7030<br>
AccountingStorageType=accounting_storage/slurmdbd<br>
<br>
(slurm-db-prod is simply the hostname of the SLURM
database server)<br>
<br>
Does that help?<br>
<br>
Tina<br>
<br>
On 28/10/2021 14:59, navin srivastava wrote:<br>
> Thank you Tina.<br>
> <br>
> so if i understood correctly.Database is global to
both cluster and <br>
> running on login Node?<br>
> or is the database running on one of the master
Node and shared with <br>
> another master server Node?<br>
> <br>
> but as far I have read that the slurm database can
also be separate on <br>
> both the master and just use the parameter <br>
> AccountingStorageExternalHost so that both
databases are aware of each <br>
> other.<br>
> <br>
> Also on the login node in slurm .conf file pointed
to which Slurmctld?<br>
> is it possible to share the sample slurm.conf file
of login Node.<br>
> <br>
> Regards<br>
> Navin.<br>
> <br>
> <br>
> <br>
> <br>
> <br>
> <br>
> <br>
> <br>
> On Thu, Oct 28, 2021 at 7:06 PM Tina Friedrich <br>
> <<a href="mailto:tina.friedrich@it.ox.ac.uk"
rel="noreferrer" target="_blank"
moz-do-not-send="true" class="moz-txt-link-freetext">tina.friedrich@it.ox.ac.uk</a>
<mailto:<a href="mailto:tina.friedrich@it.ox.ac.uk"
rel="noreferrer" target="_blank"
moz-do-not-send="true" class="moz-txt-link-freetext">tina.friedrich@it.ox.ac.uk</a>>>
wrote:<br>
> <br>
> Hi Navin,<br>
> <br>
> well, I have two clusters & login nodes
that allow access to both. That<br>
> do? I don't think a third would make any
difference in setup.<br>
> <br>
> They need to share a database. As long as the
share a database, the<br>
> clusters have 'knowledge' of each other.<br>
> <br>
> So if you set up one database server (running
slurmdbd), and then a<br>
> SLURM controller for each cluster (running
slurmctld) using that one<br>
> central database, the '-M' option should work.<br>
> <br>
> Tina<br>
> <br>
> On 28/10/2021 10:54, navin srivastava wrote:<br>
> > Hi ,<br>
> ><br>
> > I am looking for a stepwise guide to
setup multi cluster<br>
> implementation.<br>
> > We wanted to set up 3 clusters and one
Login Node to run the job<br>
> using<br>
> > -M cluster option.<br>
> > can anybody have such a setup and can
share some insight into how it<br>
> > works and it is really a stable solution.<br>
> ><br>
> ><br>
> > Regards<br>
> > Navin.<br>
> <br>
> -- <br>
> Tina Friedrich, Advanced Research Computing Snr
HPC Systems<br>
> Administrator<br>
> <br>
> Research Computing and Support Services<br>
> IT Services, University of Oxford<br>
> <a href="http://www.arc.ox.ac.uk"
rel="noreferrer noreferrer" target="_blank"
moz-do-not-send="true" class="moz-txt-link-freetext">http://www.arc.ox.ac.uk</a>
<<a href="http://www.arc.ox.ac.uk" rel="noreferrer
noreferrer" target="_blank" moz-do-not-send="true"
class="moz-txt-link-freetext">http://www.arc.ox.ac.uk</a>><br>
> <a href="http://www.it.ox.ac.uk"
rel="noreferrer noreferrer" target="_blank"
moz-do-not-send="true" class="moz-txt-link-freetext">http://www.it.ox.ac.uk</a>
<<a href="http://www.it.ox.ac.uk" rel="noreferrer
noreferrer" target="_blank" moz-do-not-send="true"
class="moz-txt-link-freetext">http://www.it.ox.ac.uk</a>><br>
> <br>
<br>
-- <br>
Tina Friedrich, Advanced Research Computing Snr HPC
Systems Administrator<br>
<br>
Research Computing and Support Services<br>
IT Services, University of Oxford<br>
<a href="http://www.arc.ox.ac.uk" rel="noreferrer
noreferrer" target="_blank" moz-do-not-send="true"
class="moz-txt-link-freetext">http://www.arc.ox.ac.uk</a>
<a href="http://www.it.ox.ac.uk" rel="noreferrer
noreferrer" target="_blank" moz-do-not-send="true"
class="moz-txt-link-freetext">http://www.it.ox.ac.uk</a><br>
<br>
</blockquote>
</div>
</blockquote>
</div>
<br clear="all">
<br>
-- <br>
<div dir="ltr" class="gmail_signature">
<div dir="ltr">
<div>
<pre style="font-family:monospace"> <span style="color:rgb(133,12,27)">/|</span> |
<span style="color:rgb(133,12,27)">\/</span> | <span style="color:rgb(51,88,104);font-weight:bold">Yair Yarom </span><span style="color:rgb(51,88,104)">| System Group (DevOps)</span>
<span style="color:rgb(92,181,149)">[]</span> | <span style="color:rgb(51,88,104);font-weight:bold">The Rachel and Selim Benin School</span>
<span style="color:rgb(92,181,149)">[]</span> <span style="color:rgb(133,12,27)">/\</span> | <span style="color:rgb(51,88,104);font-weight:bold">of Computer Science and Engineering</span>
<span style="color:rgb(92,181,149)">[]</span><span style="color:rgb(0,161,146)">//</span><span style="color:rgb(133,12,27)">\</span><span style="color:rgb(133,12,27)">\</span><span style="color:rgb(49,154,184)">/</span> | <span style="color:rgb(51,88,104)">The Hebrew University of Jerusalem</span>
<span style="color:rgb(92,181,149)">[</span><span style="color:rgb(1,84,76)">/</span><span style="color:rgb(0,161,146)">/</span> <span style="color:rgb(41,16,22)">\</span><span style="color:rgb(41,16,22)">\</span> | <span style="color:rgb(51,88,104)">T +972-2-5494522 | F +972-2-5494522</span>
<span style="color:rgb(1,84,76)">//</span> <span style="color:rgb(21,122,134)">\</span> | <span style="color:rgb(51,88,104)"><a href="mailto:irush@cs.huji.ac.il" target="_blank" moz-do-not-send="true" class="moz-txt-link-freetext">irush@cs.huji.ac.il</a></span>
<span style="color:rgb(127,130,103)">/</span><span style="color:rgb(1,84,76)">/</span> |
</pre>
</div>
</div>
</div>
</div>
</blockquote>
</body>
</html>