I think the idea of having a generous default timelimit is the wrong way to go. In fact, I think any defaults for jobs are a bad way to go.  The majority of your users will just use that default time limit, and backfill scheduling will remain useless to you.

Instead, I recommend you use your job_submit.lua to reject all jobs that don't have a wallclock time and print out a helpful error message to inform users they now need to specify a wallclock time, and provide a link to documentation on how to do that.

Requiring users to specify a time limit themselves does two things:

1. It reminds them that it's important to be conscious of timelimits when submitting jobs

2. If a job is killed before it's done and all the progress is lost because the job wasn't checkpointing, they can't blame you as the admin.

If you do this, it's easy to get the users on board by first providing useful and usable documentation on why timelimits are needed and how to set them. Be sure to hammer home the point that effective timelimits can lead to their jobs running sooner, and that effective timelimits can increase cluster efficiency/utilization, helping them get a better return on their investment (if they contribute to the clusters cost) or they'll get more science done. I like to frame it that accurate wallclock times will give them a competitive edge in getting their jobs running before other cluster users. Everyone likes to think what they're doing will give them an advantage!

My 4 cents (adjusted for inflation).

Prentice

On 6/12/25 9:11 PM, Davide DelVento via slurm-users wrote:
Sounds good, thanks for confirming it. 
Let me sleep on it wrt the "too many" QOS, or think if I should ditch this idea.
If I'll implement it, I'll post in this conversation details on how I did it.
Cheers

On Thu, Jun 12, 2025 at 6:59 AM Ansgar Esztermann-Kirchner <aeszter@mpinat.mpg.de> wrote:
On Thu, Jun 12, 2025 at 04:52:24AM -0600, Davide DelVento wrote:
> Hi Ansgar,
>
> This is indeed what I was looking for: I was not aware of PreemptExemptTime.
>
> From my cursory glance at the documentation, it seems
> that PreemptExemptTime is QOS-based and not job based though. Is that
> correct? Or could it be set per-job, perhaps on a prolog/submit lua script?

Yes, that's correct.
I guess you could create a bunch of QOS with different
PremptExemptTimes and then let the user select one (or indeed select
it from lua) but as far as I know, there is no way to set arbitrary
per-job values.

Best,

A.
--
Ansgar Esztermann
Sysadmin Dep. Theoretical and Computational Biophysics
https://www.mpinat.mpg.de/person/11315/3883774


    
-- 
Prentice Bisbal
HPC Systems Engineer III
Computational & Information Systems Laboratory (CISL)
NSF National Center for Atmospheric Research (NSF NCAR)
https://www.cisl.ucar.edu
https://ncar.ucar.edu