Thanks Loris,

Am I correct if reading in between the lines you're saying: rather than going on with my "soft" limit idea, just use the regular hard limits, being generous with the default and providing user education instead? In fact that is an alternative approach that I am considering too.

On Wed, Jun 11, 2025 at 6:15 AM Loris Bennett via slurm-users <slurm-users@lists.schedmd.com> wrote:
Hi Davide,

Davide DelVento via slurm-users
<slurm-users@lists.schedmd.com> writes:

> In the institution where I work, so far we have managed to live
> without mandatory wallclock limits (a policy decided well before I
> joined the organization), and that has been possible because the
> cluster was not very much utilized.
>
> Now that is changing, with more jobs being submitted and those being
> larger ones. As such I would like to introduce wallclock limits to
> allow slurm to be more efficient in scheduling jobs, including with
> backfill.
>
> My concern is that this user base is not used to it and therefore I
> want to make it easier for them, and avoid common complaints. I
> anticipate one of them would be "my job was cancelled even though
> there were enough nodes idle and no other job in line after mine"
> (since the cluster utilization is increasing, but not yet always full
> like it has been at most other places I know).
>
> So my question is: is it possible to implement "soft" wallclock limits
> in slurm, namely ones which would not be enforced unless necessary to
> run more jobs? In other words, is it possible to change the
> pre-emptability of a job only after some time has passed? I can think
> of some ways to hack this functionality myself with some cron or at
> jobs, and that might be easy enough to do, but I am not sure I can
> make it robust enough to cover all situations, so I'm looking for
> something either slurm-native or (if external solution) field-tested
> by someone else already, so that at least the worst kinks have been
> already ironed out.
>
> Thanks in advance for any suggestions you may provide!

We just have a default wallclock limit of 14 days, but we also have QOS
with shorter wallclock limits but with higher priorities, albeit with
for fewer jobs and resources:

$ sqos
      Name   Priority     MaxWall MaxJobs MaxSubmit            MaxTRESPU
---------- ---------- ----------- ------- --------- --------------------
    hiprio     100000    03:00:00      50       100   cpu=128,gres/gpu=4
      prio       1000  3-00:00:00     500      1000   cpu=256,gres/gpu=8
  standard          0 14-00:00:00    2000     10000  cpu=768,gres/gpu=16

We also have a page of documentation which explains how users can profit
from backfill.  Thus users have a certain incentive to specify a shorter
wallclock limit, if they can.

'sqos' is just an alias for

  sacctmgr show qos format=name,priority,maxwall,maxjobs,maxsubmitjobs,maxtrespu%20

Cheers,

Loris

--
Dr. Loris Bennett (Herr/Mr)
FUB-IT, Freie Universität Berlin


--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com