<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Yes, QoS's are dynamic.</p>
<p>-Paul Edmon-<br>
</p>
<div class="moz-cite-prefix">On 8/30/19 2:58 PM, Guillaume Perrault
Archambault wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAG1OYp2cwXKLnrzmauqMgWg2+e3f4WPz1cQ2jQJeZidcPaE96g@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">Hi Paul,
<div><br>
</div>
<div>Thanks for your pointers.<br>
<div><br>
</div>
<div>I'll looking into QOS and MCS after my paper deadline
(Sept 5). Re QOS, as expressed to Peter in the reply I just
now sent, I wonder if it the QOS of a job can be change
while it's pending (submitted but not yet running).</div>
<div><br>
</div>
<div>Regards,</div>
</div>
<div>Guillaume.</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri, Aug 30, 2019 at 10:24
AM Paul Edmon <<a href="mailto:pedmon@cfa.harvard.edu"
moz-do-not-send="true">pedmon@cfa.harvard.edu</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">A
QoS is probably your best bet. Another variant might be MCS,
which <br>
you can use to help reduce resource fragmentation. For limits
though <br>
QoS will be your best bet.<br>
<br>
-Paul Edmon-<br>
<br>
On 8/30/19 7:33 AM, Steven Dick wrote:<br>
> It would still be possible to use job arrays in this
situation, it's<br>
> just slightly messy.<br>
> So the way a job array works is that you submit a single
script, and<br>
> that script is provided an integer for each subjob. The
integer is in<br>
> a range, with a possible step (default=1).<br>
><br>
> To run the situation you describe, you would have to
predetermine how<br>
> many of each test you want to run (i.e., you coudln't
dynamically<br>
> change the number of jobs that run within one array).,
and a master<br>
> script would map the integer range to the job that was to
be started.<br>
><br>
> The most trivial way to do it would be to put the list of
regressions<br>
> in a text file and the master script would index it by
line number and<br>
> then run the appropriate command.<br>
> A more complex way would be to do some math (a divide?)
to get the<br>
> script name and subindex (modulus?) for each regression.<br>
><br>
> Both of these would require some semi-advanced scripting,
but nothing<br>
> that couldn't be cut and pasted with some trivial
modifications for<br>
> each job set.<br>
><br>
> As to the unavailability of the admin ...<br>
> An alternate approach that would require the admin's help
would be to<br>
> come up with a small set of alocations (e.g., 40 gpus, 80
gpus, 100<br>
> gpus, etc.) and make a QOS for each one with a gpu limit
(e.g.,<br>
> maxtrespu=gpu=40 ) Then the user would assign that QOS to
the job when<br>
> starting it to set the overall allocation for all the
jobs. The admin<br>
> woudln't need to tweak this except once, you just pick
which tweak to<br>
> use.<br>
><br>
> On Fri, Aug 30, 2019 at 2:36 AM Guillaume Perrault
Archambault<br>
> <<a href="mailto:gperr050@uottawa.ca" target="_blank"
moz-do-not-send="true">gperr050@uottawa.ca</a>> wrote:<br>
>> Hi Steven,<br>
>><br>
>> Thanks for taking the time to reply to my post.<br>
>><br>
>> Setting a limit on the number of jobs for a single
array isn't sufficient because regression-tests need to launch
multiple arrays, and I would need a job limit that would take
effect over all launched jobs.<br>
>><br>
>> It's very possible I'm not understand something. I'll
lay out a very specific example in the hopes you can correct
me if I've gone wrong somewhere.<br>
>><br>
>> Let's take the small cluster with 140 GPUs and no
fairshare as an example, because it's easier for me to
explain.<br>
>><br>
>> The users, who all know each other personally and
interact via chat, decide on a daily basis how many jobs each
user can run at a time.<br>
>><br>
>> Let's say today is Sunday (hypothetically). Nobody is
actively developing today, except that user 1 has 10 jobs
running for the entire weekend. That leaves 130 GPUs unused.<br>
>><br>
>> User 2, whose jobs all run on 1 GPU decides to run a
regression test. The regression test comprises of 9 different
scripts each run 40 times, for a grand total of 360 jobs. The
duration of the scripts vary from 1 and 5 hours to complete,
and the jobs take on average 4 hours to complete.<br>
>><br>
>> User 2 gets the user group's approval (via chat) to
use 90 GPUs (so that 40 GPUs will remain for anyone else
wanting to work that day).<br>
>><br>
>> The problem I'm trying to solve is this: how do I
ensure that user 2 launches his 360 jobs in such a way that 90
jobs are in the run state consistently until the regression
test is finished?<br>
>><br>
>> Keep in mind that:<br>
>><br>
>> limiting each job array to 10 jobs is inefficient:
when the first job array finishes (long before the last one),
only 80 GPUs will be used, and so on as other arrays finish<br>
>> the admin is not available, he cannot be asked to set
a hard limit of 90 jobs for user 2 just for today<br>
>><br>
>> I would be happy to use job arrays if they allow me
to set an overarching job limit across multiple arrays.
Perhaps this is doable. Admttedly I'm working on a paper to be
submitted in a few days, so I don't have time to test jobs
arrays thoroughly, but I will try out job arrays more
thoroughly once I've submitted my paper (ie after sept 5).<br>
>><br>
>> My solution, for now, is to not use job arrays.
Instead, I launch each job individually, and I use singleton
(by launching all jobs with the same 90 unique names) to
ensure that exactly 90 jobs are run at a time (in this case,
corresponding to 90 GPUs in use).<br>
>><br>
>> Side note: the unavailability of the admin might
sound contrived by picking Sunday as an example, but it's in
fact very typical. The admin is not available:<br>
>><br>
>> on weekends (the present example)<br>
>> at any time outside of 9am to 5pm (keep in mind, this
is a cluster used by students in different time zones)<br>
>> any time he is on vacation<br>
>> anytime the he is looking after his many other
responsibilities. Constantly setting user limits that change
on a daily basis would be too much too ask.<br>
>><br>
>><br>
>> I'd be happy if you corrected my misunderstandings,
especially if you could show me how to set a job limit that
takes effect over multiple job arrays.<br>
>><br>
>> I may have very glaring oversights as I don't
necessarily have a big picture view of things (I've never been
an admin, most notably), so feel free to poke holes at the way
I've constructed things.<br>
>><br>
>> Regards,<br>
>> Guillaume.<br>
>><br>
>><br>
>> On Fri, Aug 30, 2019 at 1:22 AM Steven Dick <<a
href="mailto:kg4ydw@gmail.com" target="_blank"
moz-do-not-send="true">kg4ydw@gmail.com</a>> wrote:<br>
>>> This makes no sense and seems backwards to me.<br>
>>><br>
>>> When you submit an array job, you can specify how
many jobs from the<br>
>>> array you want to run at once.<br>
>>> So, an administrator can create a QOS that
explicitly limits the user.<br>
>>> However, you keep saying that they probably won't
modify the system<br>
>>> for just you...<br>
>>><br>
>>> That seems to me to be the perfect case to use
array jobs and tell it<br>
>>> how many elements of the array to run at once.<br>
>>> You're not using array jobs for exactly the wrong
reason.<br>
>>><br>
>>> On Tue, Aug 27, 2019 at 1:19 PM Guillaume
Perrault Archambault<br>
>>> <<a href="mailto:gperr050@uottawa.ca"
target="_blank" moz-do-not-send="true">gperr050@uottawa.ca</a>>
wrote:<br>
>>>> The reason I don't use job arrays is to be
able limit the number of jobs per users<br>
<br>
</blockquote>
</div>
</blockquote>
</body>
</html>