Hi Davide,
Davide DelVento via slurm-users slurm-users@lists.schedmd.com writes:
In the institution where I work, so far we have managed to live without mandatory wallclock limits (a policy decided well before I joined the organization), and that has been possible because the cluster was not very much utilized.
Now that is changing, with more jobs being submitted and those being larger ones. As such I would like to introduce wallclock limits to allow slurm to be more efficient in scheduling jobs, including with backfill.
My concern is that this user base is not used to it and therefore I want to make it easier for them, and avoid common complaints. I anticipate one of them would be "my job was cancelled even though there were enough nodes idle and no other job in line after mine" (since the cluster utilization is increasing, but not yet always full like it has been at most other places I know).
So my question is: is it possible to implement "soft" wallclock limits in slurm, namely ones which would not be enforced unless necessary to run more jobs? In other words, is it possible to change the pre-emptability of a job only after some time has passed? I can think of some ways to hack this functionality myself with some cron or at jobs, and that might be easy enough to do, but I am not sure I can make it robust enough to cover all situations, so I'm looking for something either slurm-native or (if external solution) field-tested by someone else already, so that at least the worst kinks have been already ironed out.
Thanks in advance for any suggestions you may provide!
We just have a default wallclock limit of 14 days, but we also have QOS with shorter wallclock limits but with higher priorities, albeit with for fewer jobs and resources:
$ sqos Name Priority MaxWall MaxJobs MaxSubmit MaxTRESPU ---------- ---------- ----------- ------- --------- -------------------- hiprio 100000 03:00:00 50 100 cpu=128,gres/gpu=4 prio 1000 3-00:00:00 500 1000 cpu=256,gres/gpu=8 standard 0 14-00:00:00 2000 10000 cpu=768,gres/gpu=16
We also have a page of documentation which explains how users can profit from backfill. Thus users have a certain incentive to specify a shorter wallclock limit, if they can.
'sqos' is just an alias for
sacctmgr show qos format=name,priority,maxwall,maxjobs,maxsubmitjobs,maxtrespu%20
Cheers,
Loris