[slurm-users] Requirement to run longer jobs

Andy Georges andy.georges at ugent.be
Wed Jul 3 18:21:18 UTC 2019


On Wed, Jul 03, 2019 at 03:49:44PM +0000, David Baker wrote:
> Hello,
> A few of our users have asked about running longer jobs on our cluster. Currently our main/default compute partition has a time limit of 2.5 days. Potentially, a handful of users need jobs to run up to 5 hours. Rather than allow all users/jobs to have a run time limit of 5 days I wondered if the following scheme makes sense...

We have a similar issue, where default max walltime is 3 days, but due
to checkpointing not working properly atm, we have several high end
users asking for longer times.

> Increase the max run time on the default partition to be 5 days, however limit most users to a max of 2.5 days using the default "normal" QOS.
> Create a QOS called "long" with a max time limit of 5 days. Limit the user who can use "long". For authorized users assign "long" QOS to their jobs on basis of run time request.
> Does the above make sense or is it too complicated? If the above works could users limited to using the normal QOS have their running jobs run time increased to 5 days in exceptional circumstances?

Be aware that without restrictions, you users _will_ learn to take
advantage of the longer allowed walltime :)

We created a second partition that fully overlaps the default partition,
but with double the max wall time. Access to this partition is only
granted upon (motivated) request.

I have no idea if this make less or more sense that your proposal, to me
it's a different way to accomplish pretty much the same goal :)

Kind regards,
-- Andy

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 228 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190703/b4a6f636/attachment.sig>

More information about the slurm-users mailing list