[slurm-users] ticking time bomb? launching too many jobs in parallel

Guillaume Perrault Archambault gperr050 at uottawa.ca
Fri Aug 30 19:29:31 UTC 2019

Thank you Paul.. If admin does agree to creating various QOS job limits or
GPU limits (eg 5,10,15,20,...) then tat could be a powerful solution. This
would allow me to use job arrays.

I still prefer a user side solution if possible because I'd like my script
to be cluster-agnostic as much as possible, so avoiding to task admin on
each cluster with QOS creation would make easier going of porting these
scripts across clusters.

That said it may well end up being the best solution.


On Fri, Aug 30, 2019 at 3:16 PM Paul Edmon <pedmon at cfa.harvard.edu> wrote:

> Yes, QoS's are dynamic.
> -Paul Edmon-
> On 8/30/19 2:58 PM, Guillaume Perrault Archambault wrote:
> Hi Paul,
> Thanks for your pointers.
> I'll looking into QOS and MCS after my paper deadline (Sept 5). Re QOS, as
> expressed to Peter in the reply I just now sent, I wonder if it the QOS of
> a job can be change while it's pending (submitted but not yet running).
> Regards,
> Guillaume.
> On Fri, Aug 30, 2019 at 10:24 AM Paul Edmon <pedmon at cfa.harvard.edu>
> wrote:
>> A QoS is probably your best bet.  Another variant might be MCS, which
>> you can use to help reduce resource fragmentation.  For limits though
>> QoS will be your best bet.
>> -Paul Edmon-
>> On 8/30/19 7:33 AM, Steven Dick wrote:
>> > It would still be possible to use job arrays in this situation, it's
>> > just slightly messy.
>> > So the way a job array works is that you submit a single script, and
>> > that script is provided an integer for each subjob.  The integer is in
>> > a range, with a possible step (default=1).
>> >
>> > To run the situation you describe, you would have to predetermine how
>> > many of each test you want to run (i.e., you coudln't dynamically
>> > change the number of jobs that run within one array)., and a master
>> > script would map the integer range to the job that was to be started.
>> >
>> > The most trivial way to do it would be to put the list of regressions
>> > in a text file and the master script would index it by line number and
>> > then run the appropriate command.
>> > A more complex way would be to do some math (a divide?) to get the
>> > script name and subindex (modulus?) for each regression.
>> >
>> > Both of these would require some semi-advanced scripting, but nothing
>> > that couldn't be cut and pasted with some trivial modifications for
>> > each job set.
>> >
>> > As to the unavailability of the admin ...
>> > An alternate approach that would require the admin's help would be to
>> > come up with a small set of alocations (e.g., 40 gpus, 80 gpus, 100
>> > gpus, etc.) and make a QOS for each one with a gpu limit (e.g.,
>> > maxtrespu=gpu=40 ) Then the user would assign that QOS to the job when
>> > starting it to set the overall allocation for all the jobs.  The admin
>> > woudln't need to tweak this except once, you just pick which tweak to
>> > use.
>> >
>> > On Fri, Aug 30, 2019 at 2:36 AM Guillaume Perrault Archambault
>> > <gperr050 at uottawa.ca> wrote:
>> >> Hi Steven,
>> >>
>> >> Thanks for taking the time to reply to my post.
>> >>
>> >> Setting a limit on the number of jobs for a single array isn't
>> sufficient because regression-tests need to launch multiple arrays, and I
>> would need a job limit that would take effect over all launched jobs.
>> >>
>> >> It's very possible I'm not understand something. I'll lay out a very
>> specific example in the hopes you can correct me if I've gone wrong
>> somewhere.
>> >>
>> >> Let's take the small cluster with 140 GPUs and no fairshare as an
>> example, because it's easier for me to explain.
>> >>
>> >> The users, who all know each other personally and interact via chat,
>> decide on a daily basis how many jobs each user can run at a time.
>> >>
>> >> Let's say today is Sunday (hypothetically). Nobody is actively
>> developing today, except that user 1 has 10 jobs running for the entire
>> weekend. That leaves 130 GPUs unused.
>> >>
>> >> User 2, whose jobs all run on 1 GPU decides to run a regression test.
>> The regression test comprises of 9 different scripts each run 40 times, for
>> a grand total of 360 jobs. The duration of the scripts vary from 1 and 5
>> hours to complete, and the jobs take on average 4 hours to complete.
>> >>
>> >> User 2 gets the user group's approval (via chat) to use 90 GPUs (so
>> that 40 GPUs will remain for anyone else wanting to work that day).
>> >>
>> >> The problem I'm trying to solve is this: how do I ensure that user 2
>> launches his 360 jobs in such a way that 90 jobs are in the run state
>> consistently until the regression test is finished?
>> >>
>> >> Keep in mind that:
>> >>
>> >> limiting each job array to 10 jobs is inefficient: when the first job
>> array finishes (long before the last one), only 80 GPUs will be used, and
>> so on as other arrays finish
>> >> the admin is not available, he cannot be asked to set a hard limit of
>> 90 jobs for user 2 just for today
>> >>
>> >> I would be happy to use job arrays if they allow me to set an
>> overarching job limit across multiple arrays. Perhaps this is doable.
>> Admttedly I'm working on a paper to be submitted in a few days, so I don't
>> have time to test jobs arrays thoroughly, but I will try out job arrays
>> more thoroughly once I've submitted my paper (ie after sept 5).
>> >>
>> >> My solution, for now, is to not use job arrays. Instead, I launch each
>> job individually, and I use singleton (by launching all jobs with the same
>> 90 unique names) to ensure that exactly 90 jobs are run at a time (in this
>> case, corresponding to 90 GPUs in use).
>> >>
>> >> Side note: the unavailability of the admin might sound contrived by
>> picking Sunday as an example, but it's in fact very typical. The admin is
>> not available:
>> >>
>> >> on weekends (the present example)
>> >> at any time outside of 9am to 5pm (keep in mind, this is a cluster
>> used by students in different time zones)
>> >> any time he is on vacation
>> >> anytime the he is looking after his many other responsibilities.
>> Constantly setting user limits that change on a daily basis would be too
>> much too ask.
>> >>
>> >>
>> >> I'd be happy if you corrected my misunderstandings, especially if you
>> could show me how to set a job limit that takes effect over multiple job
>> arrays.
>> >>
>> >> I may have very glaring oversights as I don't necessarily have a big
>> picture view of things (I've never been an admin, most notably), so feel
>> free to poke holes at the way I've constructed things.
>> >>
>> >> Regards,
>> >> Guillaume.
>> >>
>> >>
>> >> On Fri, Aug 30, 2019 at 1:22 AM Steven Dick <kg4ydw at gmail.com> wrote:
>> >>> This makes no sense and seems backwards to me.
>> >>>
>> >>> When you submit an array job, you can specify how many jobs from the
>> >>> array you want to run at once.
>> >>> So, an administrator can create a QOS that explicitly limits the user.
>> >>> However, you keep saying that they probably won't modify the system
>> >>> for just you...
>> >>>
>> >>> That seems to me to be the perfect case to use array jobs and tell it
>> >>> how many elements of the array to run at once.
>> >>> You're not using array jobs for exactly the wrong reason.
>> >>>
>> >>> On Tue, Aug 27, 2019 at 1:19 PM Guillaume Perrault Archambault
>> >>> <gperr050 at uottawa.ca> wrote:
>> >>>> The reason I don't use job arrays is to be able limit the number of
>> jobs per users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190830/895c7717/attachment.htm>

More information about the slurm-users mailing list