[slurm-users] ticking time bomb? launching too many jobs in parallel

Sat Aug 31 16:41:08 UTC 2019

Probably the ideal solution would be a mix of array jobs and QOS.
I'd at least use the array jobs within a single set of regressions,
with or without per-set run limits on the array.

On Sat, Aug 31, 2019 at 11:17 AM Guillaume Perrault Archambault
<gperr050 at uottawa.ca> wrote:
>
> Hi Steven,
>
> Thanks for your help.
>
> Looks like QOS is the way to go if I want both job arrays + user limits on jobs/resources (in the context of a regression-test).
>
> Regards,
> Guillaume.
>
> On Fri, Aug 30, 2019 at 6:11 PM Steven Dick <kg4ydw at gmail.com> wrote:
>>
>> On Fri, Aug 30, 2019 at 2:58 PM Guillaume Perrault Archambault
>> <gperr050 at uottawa.ca> wrote:
>> > My problem with that though, is what if each script (the 9 scripts in my earlier example) each require different requirements? For example, run on a different partition, or set a different time limit? My understanding is that for a single job array, each job will get the same job requirements.
>>
>> That's a little messier and may be less suitable for an array job.
>> However, some of that can be accomplished.   You can for instance,
>> submit a job to multiple partitions and then use srun within the job
>> to allocate resources to individual tasks within the job.
>> But you get a lot less control over how the resources are spread, so
>> it might not be workable.
>>
>> > The other problem is that with the way I've implemented it, I can change the max jobs dynamically.
>>
>> Others have indicated in this thread that qos can be dynamically
>> changed; I don't recall trying that, but if you did, I think you'd do
>> it with scontrol.
>>