<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>We don't do anything. In our environment it is the user's
responsibility to optimize their code appropriately. Since we
have a great variety of hardware any modules we build (we have
several thousand of them) are all build generically. If people
want processor specific optimizations then they have to build
their own stack.</p>
<p>-Paul Edmon-<br>
</p>
<div class="moz-cite-prefix">On 6/20/19 11:07 AM, Fulcomer, Samuel
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAOORAuEt589vMW=2R-mx-=xC9mhNEgiFrhrFT6FnNpZrz7JhfA@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">
<div dir="ltr">...ah, got it. I was confused by "PI/Lab nodes"
in your partition list.</div>
<div dir="ltr"><br>
</div>
<div>Our QoS/account pair for each investigator condo is our
approximate equivalent of what you're doing with owned
partitions. </div>
<div><br>
</div>
<div>Since we have everything in one partition we segregate
processor types via topology.conf. We break up topology.conf
further to keep MPI jobs on the same switch.</div>
<div><br>
</div>
<div>On another topic, how do you address code optimization for
processor type? We've been mostly linking with MKL and relying
on its muti-code-path. </div>
<div><br>
</div>
<div>Regards,</div>
<div>Sam</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, Jun 20, 2019 at
10:20 AM Paul Edmon <<a
href="mailto:pedmon@cfa.harvard.edu"
moz-do-not-send="true">pedmon@cfa.harvard.edu</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p>People will specify which partition they need or if
they want multiple they use this:<br>
</p>
<p>#SBATCH -p general,shared,serial_requeue</p>
<p>As then the scheduler will just select which partition
they will run in first. Naturally there is a risk that
you will end up running in a more expensive partition.</p>
<p>Our time limit is only applied to our public
partitions, our owned partitions (of which we have
roughly 80) have no time limit. So if they run on their
dedicated resources they have no penalty. We've been
working on getting rid of owned partitions and moving to
a school/department based partition, where all the
purchased resources for different PI's go into the same
bucket where they compete against themselves and not the
wider community. We've found that this ends up working
pretty well as most PI's only used their purchased
resources sporadically. Thus there are usually idle
cores lying around that we backfill with our serial
queues. Since those are requeueable we can get
immediate response to access that idle space. We are
also toying with a high priority partition that is open
to people with high fairshare so that they can get
immediate response as those with high fairshare tend to
be bursty users.</p>
<p>Our current halflife is set to a month and we keep 6
months of data in our database. I'd actually like to
get rid of the halflife and just go to a 3 month moving
window to allow people to bank their fairshare, but we
haven't done that yet as people have been having a hard
enough time understanding our current system. It's not
due to its complexity but more that most people just
flat out aren't cognizant of their usage and think the
resource is functionally infinite.</p>
<p>-Paul Edmon-<br>
</p>
<div class="gmail-m_-2295429921239604436moz-cite-prefix">On
6/19/19 5:16 PM, Fulcomer, Samuel wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr">Hi Paul,
<div><br>
</div>
<div>Thanks..Your setup is interesting. I see that
you have your processor types segregated in their
own partitions (with the exception of of the
requeue partition), and that's how you get at the
weighting mechanism. Do you have your users
explicitly specify multiple partitions in the
batch commands/scripts in order to take advantage
of this, or do you use a plugin for it?</div>
<div><br>
</div>
<div>It sounds like you don't impose any hard limit
on simultaneous resource use, and allow everything
to fairshare out with the help of the 7 day
TimeLimit. We haven't been imposing any TimeLimit
on our condo users, which would be an issue for us
with your config. For our exploratory and priority
users, we impose an effective time limit with
GrpTRESRunMins=cpu (and gres/gpu= for the GPU
usage). In addition, since we have so many
priority users, we don't explicitly set a rawshare
value for them (they all execute under the
"default" account). We set rawshare for the condo
accounts as cores-purchased/total-cores*1000. </div>
<div><br>
</div>
<div>What's your fairshare decay setting (don't
remember the proper name at the moment)?</div>
<div><br>
</div>
<div>Regards,</div>
<div>Sam</div>
<div><br>
</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Wed, Jun 19,
2019 at 3:44 PM Paul Edmon <<a
href="mailto:pedmon@cfa.harvard.edu"
target="_blank" moz-do-not-send="true">pedmon@cfa.harvard.edu</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px
0px 0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF">
<p>We do a similar thing here at Harvard:</p>
<p><a
class="gmail-m_-2295429921239604436gmail-m_8457408054565706666moz-txt-link-freetext"
href="https://www.rc.fas.harvard.edu/fairshare/" target="_blank"
moz-do-not-send="true">https://www.rc.fas.harvard.edu/fairshare/</a></p>
<p>We simply weight all the partitions based on
their core type and then we allocate Shares
for each account based on what they have
purchased. We don't use QoS at all, so we
just rely purely on fairshare weighting for
resource usage. It has worked pretty well for
our purposes.</p>
<p>-Paul Edmon-<br>
</p>
<div
class="gmail-m_-2295429921239604436gmail-m_8457408054565706666moz-cite-prefix">On
6/19/19 3:30 PM, Fulcomer, Samuel wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr"><br>
<div>(...and yes, the name is inspired by a
certain OEM's software licensing
schemes...)</div>
<div><br>
</div>
<div>At Brown we run a ~400 node cluster
containing nodes of multiple architectures
(Sandy/Ivy, Haswell/Broadwell, and
Sky/Cascade) purchased in some cases by
University funds and in others by
investigator funding (~50:50). They all
appear in the default SLURM partition. We
have 3 classes of SLURM users:</div>
<div><br>
</div>
<div>
<ol>
<li>Exploratory - no-charge access to up
to 16 cores</li>
<li>Priority - $750/quarter for access
to up to 192 cores (and with a
GrpTRESRunMins=cpu limit). Each user
has their own QoS</li>
<li>Condo - an investigator group who
paid for nodes added to the cluster.
The group has its own QoS and SLURM
Account. The QoS allows use of the
number of cores purchased and has a
much higher priority than the QoS' of
the "priority" users.</li>
</ol>
<div>The first problem with this scheme is
that condo users who have purchased the
older hardware now have access to the
newest without penalty. In addition,
we're encountering resistance to the
idea of turning off their hardware and
terminating their condos (despite MOUs
stating a 5yr life). The pushback is the
stated belief that the hardware should
run until it dies.</div>
</div>
<div><br>
</div>
<div>What I propose is a new TRES called a
Processor Performance Unit (PPU) that
would be specified on the Node line in
slurm.conf, and used such that
GrpTRES=ppu=N was calculated as the number
of allocated cores multiplied by their
associated PPU numbers.</div>
<div><br>
</div>
<div>We could then assign a base PPU to the
oldest hardware, say, "1" for Sandy/Ivy
and increase for later architectures based
on performance improvement. We'd set the
condo QoS to GrpTRES=ppu=N*X+M*Y,...,
where N is the number of cores of the
oldest architecture multiplied by the
configured PPU/core, X, and repeat for any
newer nodes/cores the investigator has
purchased since.</div>
<div><br>
</div>
<div>The result is that the investigator
group gets to run on an approximation of
the performance that they've purchased,
rather on the raw purchased core count.</div>
<div><br>
</div>
<div>Thoughts?</div>
<div><br>
</div>
<div><br>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</div>
</blockquote>
</body>
</html>