<div dir="ltr">Hi Steven,<div><br></div><div>Those both sound like potentially good solutions.</div><div><br></div><div>So basically, you're saying that if I script it properly, I can use a single job array to launch multiple scripts by using a master sbatch script.</div><div><br></div><div>My problem with that though, is what if each script (the 9 scripts in my earlier example) each require different requirements? For example, run on a different partition, or set a different time limit? My understanding is that for a single job array, each job will get the same job requirements.</div><div><br></div><div>The other problem is that with the way I've implemented it, I can change the max jobs dynamically.</div><div><br></div><div>I'll illustrate this using my earlier example. Suppose user 2 launches his 360 jobs with a 90 job limit (leaving 40 unused GPUs), and then user 3 realizes he needs to use 45 GPUs.</div><div><br></div><div>User 2 decides to drop his usage to 45 max jobs.</div><div><br></div><div>He can simply change the names of his pending singleton jobs to have 45 unique names, so that he will reduce his max jobs to 45 instead of 90 (I wrote a script to do that, so it's a one liner for user 2)</div><div><br></div><div>Can the max job limit be modified after submission time using one big job array?</div><br>In the docs it gives the '%' separator to limit the concurrent number of jobs "--array=0-15%4" I could be wrong, but this sounds like a submit time-only option that cannot be change after submission.<div><br></div><div>I also kindof like the varoius QOS for different job limits. I'm not sure I'll be able to get the admin on board, but I'll bring it up. Even if I do get them on board, will I have the same problem of locking the max limit at submit time?</div><div><br></div><div>Can you change the QOS of a job when it's still pending?</div><div><br></div><div>Thanks a lot for your help!</div><div><br></div><div>Regards,</div><div>Guillaume</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Aug 30, 2019 at 7:36 AM Steven Dick <<a href="mailto:kg4ydw@gmail.com">kg4ydw@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">It would still be possible to use job arrays in this situation, it's<br>

just slightly messy.<br>

So the way a job array works is that you submit a single script, and<br>

that script is provided an integer for each subjob.  The integer is in<br>

a range, with a possible step (default=1).<br>

<br>

To run the situation you describe, you would have to predetermine how<br>

many of each test you want to run (i.e., you coudln't dynamically<br>

change the number of jobs that run within one array)., and a master<br>

script would map the integer range to the job that was to be started.<br>

<br>

The most trivial way to do it would be to put the list of regressions<br>

in a text file and the master script would index it by line number and<br>

then run the appropriate command.<br>

A more complex way would be to do some math (a divide?) to get the<br>

script name and subindex (modulus?) for each regression.<br>

<br>

Both of these would require some semi-advanced scripting, but nothing<br>

that couldn't be cut and pasted with some trivial modifications for<br>

each job set.<br>

<br>

As to the unavailability of the admin ...<br>

An alternate approach that would require the admin's help would be to<br>

come up with a small set of alocations (e.g., 40 gpus, 80 gpus, 100<br>

gpus, etc.) and make a QOS for each one with a gpu limit (e.g.,<br>

maxtrespu=gpu=40 ) Then the user would assign that QOS to the job when<br>

starting it to set the overall allocation for all the jobs.  The admin<br>

woudln't need to tweak this except once, you just pick which tweak to<br>

use.<br>

<br>

On Fri, Aug 30, 2019 at 2:36 AM Guillaume Perrault Archambault<br>

<<a href="mailto:gperr050@uottawa.ca" target="_blank">gperr050@uottawa.ca</a>> wrote:<br>

><br>

> Hi Steven,<br>

><br>

> Thanks for taking the time to reply to my post.<br>

><br>

> Setting a limit on the number of jobs for a single array isn't sufficient because regression-tests need to launch multiple arrays, and I would need a job limit that would take effect over all launched jobs.<br>

><br>

> It's very possible I'm not understand something. I'll lay out a very specific example in the hopes you can correct me if I've gone wrong somewhere.<br>

><br>

> Let's take the small cluster with 140 GPUs and no fairshare as an example, because it's easier for me to explain.<br>

><br>

> The users, who all know each other personally and interact via chat, decide on a daily basis how many jobs each user can run at a time.<br>

><br>

> Let's say today is Sunday (hypothetically). Nobody is actively developing today, except that user 1 has 10 jobs running for the entire weekend. That leaves 130 GPUs unused.<br>

><br>

> User 2, whose jobs all run on 1 GPU decides to run a regression test. The regression test comprises of 9 different scripts each run 40 times, for a grand total of 360 jobs. The duration of the scripts vary from 1 and 5 hours to complete, and the jobs take on average 4 hours to complete.<br>

><br>

> User 2 gets the user group's approval (via chat) to use 90 GPUs (so that 40 GPUs will remain for anyone else wanting to work that day).<br>

><br>

> The problem I'm trying to solve is this: how do I ensure that user 2 launches his 360 jobs in such a way that 90 jobs are in the run state consistently until the regression test is finished?<br>

><br>

> Keep in mind that:<br>

><br>

> limiting each job array to 10 jobs is inefficient: when the first job array finishes (long before the last one), only 80 GPUs will be used, and so on as other arrays finish<br>

> the admin is not available, he cannot be asked to set a hard limit of 90 jobs for user 2 just for today<br>

><br>

> I would be happy to use job arrays if they allow me to set an overarching job limit across multiple arrays. Perhaps this is doable. Admttedly I'm working on a paper to be submitted in a few days, so I don't have time to test jobs arrays thoroughly, but I will try out job arrays more thoroughly once I've submitted my paper (ie after sept 5).<br>

><br>

> My solution, for now, is to not use job arrays. Instead, I launch each job individually, and I use singleton (by launching all jobs with the same 90 unique names) to ensure that exactly 90 jobs are run at a time (in this case, corresponding to 90 GPUs in use).<br>

><br>

> Side note: the unavailability of the admin might sound contrived by picking Sunday as an example, but it's in fact very typical. The admin is not available:<br>

><br>

> on weekends (the present example)<br>

> at any time outside of 9am to 5pm (keep in mind, this is a cluster used by students in different time zones)<br>

> any time he is on vacation<br>

> anytime the he is looking after his many other responsibilities. Constantly setting user limits that change on a daily basis would be too much too ask.<br>

><br>

><br>

> I'd be happy if you corrected my misunderstandings, especially if you could show me how to set a job limit that takes effect over multiple job arrays.<br>

><br>

> I may have very glaring oversights as I don't necessarily have a big picture view of things (I've never been an admin, most notably), so feel free to poke holes at the way I've constructed things.<br>

><br>

> Regards,<br>

> Guillaume.<br>

><br>

><br>

> On Fri, Aug 30, 2019 at 1:22 AM Steven Dick <<a href="mailto:kg4ydw@gmail.com" target="_blank">kg4ydw@gmail.com</a>> wrote:<br>

>><br>

>> This makes no sense and seems backwards to me.<br>

>><br>

>> When you submit an array job, you can specify how many jobs from the<br>

>> array you want to run at once.<br>

>> So, an administrator can create a QOS that explicitly limits the user.<br>

>> However, you keep saying that they probably won't modify the system<br>

>> for just you...<br>

>><br>

>> That seems to me to be the perfect case to use array jobs and tell it<br>

>> how many elements of the array to run at once.<br>

>> You're not using array jobs for exactly the wrong reason.<br>

>><br>

>> On Tue, Aug 27, 2019 at 1:19 PM Guillaume Perrault Archambault<br>

>> <<a href="mailto:gperr050@uottawa.ca" target="_blank">gperr050@uottawa.ca</a>> wrote:<br>

>> > The reason I don't use job arrays is to be able limit the number of jobs per users<br>

>><br>

<br>

</blockquote></div>