<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>Yes, QoS's are dynamic.</p>

    <p>-Paul Edmon-<br>

    </p>

    <div class="moz-cite-prefix">On 8/30/19 2:58 PM, Guillaume Perrault

      Archambault wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAG1OYp2cwXKLnrzmauqMgWg2+e3f4WPz1cQ2jQJeZidcPaE96g@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div dir="ltr">Hi Paul,

        <div><br>

        </div>

        <div>Thanks for your pointers.<br>

          <div><br>

          </div>

          <div>I'll looking into QOS and MCS after my paper deadline

            (Sept 5). Re QOS, as expressed to Peter in the reply I just

            now sent, I wonder if it the QOS of a job can be change

            while it's pending (submitted but not yet running).</div>

          <div><br>

          </div>

          <div>Regards,</div>

        </div>

        <div>Guillaume.</div>

      </div>

      <br>

      <div class="gmail_quote">

        <div dir="ltr" class="gmail_attr">On Fri, Aug 30, 2019 at 10:24

          AM Paul Edmon <<a href="mailto:pedmon@cfa.harvard.edu"

            moz-do-not-send="true">pedmon@cfa.harvard.edu</a>> wrote:<br>

        </div>

        <blockquote class="gmail_quote" style="margin:0px 0px 0px

          0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">A

          QoS is probably your best bet.  Another variant might be MCS,

          which <br>

          you can use to help reduce resource fragmentation.  For limits

          though <br>

          QoS will be your best bet.<br>

          <br>

          -Paul Edmon-<br>

          <br>

          On 8/30/19 7:33 AM, Steven Dick wrote:<br>

          > It would still be possible to use job arrays in this

          situation, it's<br>

          > just slightly messy.<br>

          > So the way a job array works is that you submit a single

          script, and<br>

          > that script is provided an integer for each subjob.  The

          integer is in<br>

          > a range, with a possible step (default=1).<br>

          ><br>

          > To run the situation you describe, you would have to

          predetermine how<br>

          > many of each test you want to run (i.e., you coudln't

          dynamically<br>

          > change the number of jobs that run within one array).,

          and a master<br>

          > script would map the integer range to the job that was to

          be started.<br>

          ><br>

          > The most trivial way to do it would be to put the list of

          regressions<br>

          > in a text file and the master script would index it by

          line number and<br>

          > then run the appropriate command.<br>

          > A more complex way would be to do some math (a divide?)

          to get the<br>

          > script name and subindex (modulus?) for each regression.<br>

          ><br>

          > Both of these would require some semi-advanced scripting,

          but nothing<br>

          > that couldn't be cut and pasted with some trivial

          modifications for<br>

          > each job set.<br>

          ><br>

          > As to the unavailability of the admin ...<br>

          > An alternate approach that would require the admin's help

          would be to<br>

          > come up with a small set of alocations (e.g., 40 gpus, 80

          gpus, 100<br>

          > gpus, etc.) and make a QOS for each one with a gpu limit

          (e.g.,<br>

          > maxtrespu=gpu=40 ) Then the user would assign that QOS to

          the job when<br>

          > starting it to set the overall allocation for all the

          jobs.  The admin<br>

          > woudln't need to tweak this except once, you just pick

          which tweak to<br>

          > use.<br>

          ><br>

          > On Fri, Aug 30, 2019 at 2:36 AM Guillaume Perrault

          Archambault<br>

          > <<a href="mailto:gperr050@uottawa.ca" target="_blank"

            moz-do-not-send="true">gperr050@uottawa.ca</a>> wrote:<br>

          >> Hi Steven,<br>

          >><br>

          >> Thanks for taking the time to reply to my post.<br>

          >><br>

          >> Setting a limit on the number of jobs for a single

          array isn't sufficient because regression-tests need to launch

          multiple arrays, and I would need a job limit that would take

          effect over all launched jobs.<br>

          >><br>

          >> It's very possible I'm not understand something. I'll

          lay out a very specific example in the hopes you can correct

          me if I've gone wrong somewhere.<br>

          >><br>

          >> Let's take the small cluster with 140 GPUs and no

          fairshare as an example, because it's easier for me to

          explain.<br>

          >><br>

          >> The users, who all know each other personally and

          interact via chat, decide on a daily basis how many jobs each

          user can run at a time.<br>

          >><br>

          >> Let's say today is Sunday (hypothetically). Nobody is

          actively developing today, except that user 1 has 10 jobs

          running for the entire weekend. That leaves 130 GPUs unused.<br>

          >><br>

          >> User 2, whose jobs all run on 1 GPU decides to run a

          regression test. The regression test comprises of 9 different

          scripts each run 40 times, for a grand total of 360 jobs. The

          duration of the scripts vary from 1 and 5 hours to complete,

          and the jobs take on average 4 hours to complete.<br>

          >><br>

          >> User 2 gets the user group's approval (via chat) to

          use 90 GPUs (so that 40 GPUs will remain for anyone else

          wanting to work that day).<br>

          >><br>

          >> The problem I'm trying to solve is this: how do I

          ensure that user 2 launches his 360 jobs in such a way that 90

          jobs are in the run state consistently until the regression

          test is finished?<br>

          >><br>

          >> Keep in mind that:<br>

          >><br>

          >> limiting each job array to 10 jobs is inefficient:

          when the first job array finishes (long before the last one),

          only 80 GPUs will be used, and so on as other arrays finish<br>

          >> the admin is not available, he cannot be asked to set

          a hard limit of 90 jobs for user 2 just for today<br>

          >><br>

          >> I would be happy to use job arrays if they allow me

          to set an overarching job limit across multiple arrays.

          Perhaps this is doable. Admttedly I'm working on a paper to be

          submitted in a few days, so I don't have time to test jobs

          arrays thoroughly, but I will try out job arrays more

          thoroughly once I've submitted my paper (ie after sept 5).<br>

          >><br>

          >> My solution, for now, is to not use job arrays.

          Instead, I launch each job individually, and I use singleton

          (by launching all jobs with the same 90 unique names) to

          ensure that exactly 90 jobs are run at a time (in this case,

          corresponding to 90 GPUs in use).<br>

          >><br>

          >> Side note: the unavailability of the admin might

          sound contrived by picking Sunday as an example, but it's in

          fact very typical. The admin is not available:<br>

          >><br>

          >> on weekends (the present example)<br>

          >> at any time outside of 9am to 5pm (keep in mind, this

          is a cluster used by students in different time zones)<br>

          >> any time he is on vacation<br>

          >> anytime the he is looking after his many other

          responsibilities. Constantly setting user limits that change

          on a daily basis would be too much too ask.<br>

          >><br>

          >><br>

          >> I'd be happy if you corrected my misunderstandings,

          especially if you could show me how to set a job limit that

          takes effect over multiple job arrays.<br>

          >><br>

          >> I may have very glaring oversights as I don't

          necessarily have a big picture view of things (I've never been

          an admin, most notably), so feel free to poke holes at the way

          I've constructed things.<br>

          >><br>

          >> Regards,<br>

          >> Guillaume.<br>

          >><br>

          >><br>

          >> On Fri, Aug 30, 2019 at 1:22 AM Steven Dick <<a

            href="mailto:kg4ydw@gmail.com" target="_blank"

            moz-do-not-send="true">kg4ydw@gmail.com</a>> wrote:<br>

          >>> This makes no sense and seems backwards to me.<br>

          >>><br>

          >>> When you submit an array job, you can specify how

          many jobs from the<br>

          >>> array you want to run at once.<br>

          >>> So, an administrator can create a QOS that

          explicitly limits the user.<br>

          >>> However, you keep saying that they probably won't

          modify the system<br>

          >>> for just you...<br>

          >>><br>

          >>> That seems to me to be the perfect case to use

          array jobs and tell it<br>

          >>> how many elements of the array to run at once.<br>

          >>> You're not using array jobs for exactly the wrong

          reason.<br>

          >>><br>

          >>> On Tue, Aug 27, 2019 at 1:19 PM Guillaume

          Perrault Archambault<br>

          >>> <<a href="mailto:gperr050@uottawa.ca"

            target="_blank" moz-do-not-send="true">gperr050@uottawa.ca</a>>

          wrote:<br>

          >>>> The reason I don't use job arrays is to be

          able limit the number of jobs per users<br>

          <br>

        </blockquote>

      </div>

    </blockquote>

  </body>

</html>