I think the idea of having a generous default timelimit is the
wrong way to go. In fact, I think any defaults for jobs are a bad
way to go. The majority of your users will just use that default
time limit, and backfill scheduling will remain useless to you.
Instead, I recommend you use your job_submit.lua to reject all
jobs that don't have a wallclock time and print out a helpful
error message to inform users they now need to specify a wallclock
time, and provide a link to documentation on how to do that.
Requiring users to specify a time limit themselves does two
things:
1. It reminds them that it's important to be conscious of timelimits when submitting jobs
2. If a job is killed before it's done and all the progress is
lost because the job wasn't checkpointing, they can't blame you as
the admin.
If you do this, it's easy to get the users on board by first
providing useful and usable documentation on why timelimits are
needed and how to set them. Be sure to hammer home the point that
effective timelimits can lead to their jobs running sooner, and
that effective timelimits can increase cluster
efficiency/utilization, helping them get a better return on their
investment (if they contribute to the clusters cost) or they'll
get more science done. I like to frame it that accurate wallclock
times will give them a competitive edge in getting their jobs
running before other cluster users. Everyone likes to think what
they're doing will give them an advantage!
My 4 cents (adjusted for inflation).
Prentice
Sounds good, thanks for confirming it.Let me sleep on it wrt the "too many" QOS, or think if I should ditch this idea.If I'll implement it, I'll post in this conversation details on how I did it.Cheers
On Thu, Jun 12, 2025 at 6:59 AM Ansgar Esztermann-Kirchner <aeszter@mpinat.mpg.de> wrote:
On Thu, Jun 12, 2025 at 04:52:24AM -0600, Davide DelVento wrote:
> Hi Ansgar,
>
> This is indeed what I was looking for: I was not aware of PreemptExemptTime.
>
> From my cursory glance at the documentation, it seems
> that PreemptExemptTime is QOS-based and not job based though. Is that
> correct? Or could it be set per-job, perhaps on a prolog/submit lua script?
Yes, that's correct.
I guess you could create a bunch of QOS with different
PremptExemptTimes and then let the user select one (or indeed select
it from lua) but as far as I know, there is no way to set arbitrary
per-job values.
Best,
A.
--
Ansgar Esztermann
Sysadmin Dep. Theoretical and Computational Biophysics
https://www.mpinat.mpg.de/person/11315/3883774
-- Prentice Bisbal HPC Systems Engineer III Computational & Information Systems Laboratory (CISL) NSF National Center for Atmospheric Research (NSF NCAR) https://www.cisl.ucar.edu https://ncar.ucar.edu