On Wed, Jun 11, 2025 at 6:15 AM Loris Bennett via slurm-users <slurm-users@lists.schedmd.com> wrote:

Hi Davide,

Davide DelVento via slurm-users
<slurm-users@lists.schedmd.com> writes:

> In the institution where I work, so far we have managed to live
> without mandatory wallclock limits (a policy decided well before I
> joined the organization), and that has been possible because the
> cluster was not very much utilized.
>
> Now that is changing, with more jobs being submitted and those being
> larger ones. As such I would like to introduce wallclock limits to
> allow slurm to be more efficient in scheduling jobs, including with
> backfill.
>
> My concern is that this user base is not used to it and therefore I
> want to make it easier for them, and avoid common complaints. I
> anticipate one of them would be "my job was cancelled even though
> there were enough nodes idle and no other job in line after mine"
> (since the cluster utilization is increasing, but not yet always full
> like it has been at most other places I know).
>
> So my question is: is it possible to implement "soft" wallclock limits
> in slurm, namely ones which would not be enforced unless necessary to
> run more jobs? In other words, is it possible to change the
> pre-emptability of a job only after some time has passed? I can think
> of some ways to hack this functionality myself with some cron or at
> jobs, and that might be easy enough to do, but I am not sure I can
> make it robust enough to cover all situations, so I'm looking for
> something either slurm-native or (if external solution) field-tested
> by someone else already, so that at least the worst kinks have been
> already ironed out.
>
> Thanks in advance for any suggestions you may provide!

We just have a default wallclock limit of 14 days, but we also have QOS
with shorter wallclock limits but with higher priorities, albeit with
for fewer jobs and resources:

$ sqos
Name Priority MaxWall MaxJobs MaxSubmit MaxTRESPU
---------- ---------- ----------- ------- --------- --------------------
hiprio 100000 03:00:00 50 100 cpu=128,gres/gpu=4
prio 1000 3-00:00:00 500 1000 cpu=256,gres/gpu=8
standard 0 14-00:00:00 2000 10000 cpu=768,gres/gpu=16

We also have a page of documentation which explains how users can profit
from backfill. Thus users have a certain incentive to specify a shorter
wallclock limit, if they can.

'sqos' is just an alias for

sacctmgr show qos format=name,priority,maxwall,maxjobs,maxsubmitjobs,maxtrespu%20

Cheers,

Loris

--
Dr. Loris Bennett (Herr/Mr)
FUB-IT, Freie Universität Berlin

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com