In the institution where I work, so far we have managed to live without mandatory wallclock limits (a policy decided well before I joined the organization), and that has been possible because the cluster was not very much utilized.
Now that is changing, with more jobs being submitted and those being larger ones. As such I would like to introduce wallclock limits to allow slurm to be more efficient in scheduling jobs, including with backfill.
My concern is that this user base is not used to it and therefore I want to make it easier for them, and avoid common complaints. I anticipate one of them would be "my job was cancelled even though there were enough nodes idle and no other job in line after mine" (since the cluster utilization is increasing, but not yet always full like it has been at most other places I know).
So my question is: is it possible to implement "soft" wallclock limits in slurm, namely ones which would not be enforced unless necessary to run more jobs? In other words, is it possible to change the pre-emptability of a job only after some time has passed? I can think of some ways to hack this functionality myself with some cron or at jobs, and that might be easy enough to do, but I am not sure I can make it robust enough to cover all situations, so I'm looking for something either slurm-native or (if external solution) field-tested by someone else already, so that at least the worst kinks have been already ironed out.
Thanks in advance for any suggestions you may provide!
Hi Davide,
Davide DelVento via slurm-users slurm-users@lists.schedmd.com writes:
In the institution where I work, so far we have managed to live without mandatory wallclock limits (a policy decided well before I joined the organization), and that has been possible because the cluster was not very much utilized.
Now that is changing, with more jobs being submitted and those being larger ones. As such I would like to introduce wallclock limits to allow slurm to be more efficient in scheduling jobs, including with backfill.
My concern is that this user base is not used to it and therefore I want to make it easier for them, and avoid common complaints. I anticipate one of them would be "my job was cancelled even though there were enough nodes idle and no other job in line after mine" (since the cluster utilization is increasing, but not yet always full like it has been at most other places I know).
So my question is: is it possible to implement "soft" wallclock limits in slurm, namely ones which would not be enforced unless necessary to run more jobs? In other words, is it possible to change the pre-emptability of a job only after some time has passed? I can think of some ways to hack this functionality myself with some cron or at jobs, and that might be easy enough to do, but I am not sure I can make it robust enough to cover all situations, so I'm looking for something either slurm-native or (if external solution) field-tested by someone else already, so that at least the worst kinks have been already ironed out.
Thanks in advance for any suggestions you may provide!
We just have a default wallclock limit of 14 days, but we also have QOS with shorter wallclock limits but with higher priorities, albeit with for fewer jobs and resources:
$ sqos Name Priority MaxWall MaxJobs MaxSubmit MaxTRESPU ---------- ---------- ----------- ------- --------- -------------------- hiprio 100000 03:00:00 50 100 cpu=128,gres/gpu=4 prio 1000 3-00:00:00 500 1000 cpu=256,gres/gpu=8 standard 0 14-00:00:00 2000 10000 cpu=768,gres/gpu=16
We also have a page of documentation which explains how users can profit from backfill. Thus users have a certain incentive to specify a shorter wallclock limit, if they can.
'sqos' is just an alias for
sacctmgr show qos format=name,priority,maxwall,maxjobs,maxsubmitjobs,maxtrespu%20
Cheers,
Loris
Thanks Loris,
Am I correct if reading in between the lines you're saying: rather than going on with my "soft" limit idea, just use the regular hard limits, being generous with the default and providing user education instead? In fact that is an alternative approach that I am considering too.
On Wed, Jun 11, 2025 at 6:15 AM Loris Bennett via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi Davide,
Davide DelVento via slurm-users slurm-users@lists.schedmd.com writes:
In the institution where I work, so far we have managed to live without mandatory wallclock limits (a policy decided well before I joined the organization), and that has been possible because the cluster was not very much utilized.
Now that is changing, with more jobs being submitted and those being larger ones. As such I would like to introduce wallclock limits to allow slurm to be more efficient in scheduling jobs, including with backfill.
My concern is that this user base is not used to it and therefore I want to make it easier for them, and avoid common complaints. I anticipate one of them would be "my job was cancelled even though there were enough nodes idle and no other job in line after mine" (since the cluster utilization is increasing, but not yet always full like it has been at most other places I know).
So my question is: is it possible to implement "soft" wallclock limits in slurm, namely ones which would not be enforced unless necessary to run more jobs? In other words, is it possible to change the pre-emptability of a job only after some time has passed? I can think of some ways to hack this functionality myself with some cron or at jobs, and that might be easy enough to do, but I am not sure I can make it robust enough to cover all situations, so I'm looking for something either slurm-native or (if external solution) field-tested by someone else already, so that at least the worst kinks have been already ironed out.
Thanks in advance for any suggestions you may provide!
We just have a default wallclock limit of 14 days, but we also have QOS with shorter wallclock limits but with higher priorities, albeit with for fewer jobs and resources:
$ sqos Name Priority MaxWall MaxJobs MaxSubmit MaxTRESPU
hiprio 100000 03:00:00 50 100 cpu=128,gres/gpu=4 prio 1000 3-00:00:00 500 1000 cpu=256,gres/gpu=8
standard 0 14-00:00:00 2000 10000 cpu=768,gres/gpu=16
We also have a page of documentation which explains how users can profit from backfill. Thus users have a certain incentive to specify a shorter wallclock limit, if they can.
'sqos' is just an alias for
sacctmgr show qos format=name,priority,maxwall,maxjobs,maxsubmitjobs,maxtrespu%20
Cheers,
Loris
-- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Hi Davide,
Davide DelVento davide.quantum@gmail.com writes:
Thanks Loris,
Am I correct if reading in between the lines you're saying: rather than going on with my "soft" limit idea, just use the regular hard limits, being generous with the default and providing user education instead? In fact that is an alternative approach that I am considering too.
Yes. We in fact never get anyone complaining about jobs being cancelled due to having reached their time-limit, even though other resources were idle. Having said that, we do occasionally extend the time-limit for individual jobs, when requested. We also don't pre-empt any jobs.
Apart from that, I imaging implementing your 'soft' limits robustly might be quite challenging and/or time-consuming, as I am not aware that Slurm has anything like that built in.
Cheers,
Loris
On Wed, Jun 11, 2025 at 6:15 AM Loris Bennett via slurm-users slurm-users@lists.schedmd.com wrote:
Hi Davide,
Davide DelVento via slurm-users slurm-users@lists.schedmd.com writes:
In the institution where I work, so far we have managed to live without mandatory wallclock limits (a policy decided well before I joined the organization), and that has been possible because the cluster was not very much utilized.
Now that is changing, with more jobs being submitted and those being larger ones. As such I would like to introduce wallclock limits to allow slurm to be more efficient in scheduling jobs, including with backfill.
My concern is that this user base is not used to it and therefore I want to make it easier for them, and avoid common complaints. I anticipate one of them would be "my job was cancelled even though there were enough nodes idle and no other job in line after mine" (since the cluster utilization is increasing, but not yet always full like it has been at most other places I know).
So my question is: is it possible to implement "soft" wallclock limits in slurm, namely ones which would not be enforced unless necessary to run more jobs? In other words, is it possible to change the pre-emptability of a job only after some time has passed? I can think of some ways to hack this functionality myself with some cron or at jobs, and that might be easy enough to do, but I am not sure I can make it robust enough to cover all situations, so I'm looking for something either slurm-native or (if external solution) field-tested by someone else already, so that at least the worst kinks have been already ironed out.
Thanks in advance for any suggestions you may provide!
We just have a default wallclock limit of 14 days, but we also have QOS with shorter wallclock limits but with higher priorities, albeit with for fewer jobs and resources:
$ sqos Name Priority MaxWall MaxJobs MaxSubmit MaxTRESPU
hiprio 100000 03:00:00 50 100 cpu=128,gres/gpu=4 prio 1000 3-00:00:00 500 1000 cpu=256,gres/gpu=8
standard 0 14-00:00:00 2000 10000 cpu=768,gres/gpu=16
We also have a page of documentation which explains how users can profit from backfill. Thus users have a certain incentive to specify a shorter wallclock limit, if they can.
'sqos' is just an alias for
sacctmgr show qos format=name,priority,maxwall,maxjobs,maxsubmitjobs,maxtrespu%20
Cheers,
Loris
-- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Hi Davide,
I think it should be possible to emulate this via preemption: if you set PreemptMode to CANCEL, a preempted job will behave just as if it reached the end of its wall time. Then, you can use PreemptExemptTime as your soft wall time limit -- the job will not be preempted before PreemptExemptTime has passed.
See https://slurm.schedmd.com/preempt.html
Best,
A.
Hi Ansgar,
This is indeed what I was looking for: I was not aware of PreemptExemptTime.
From my cursory glance at the documentation, it seems that PreemptExemptTime is QOS-based and not job based though. Is that correct? Or could it be set per-job, perhaps on a prolog/submit lua script? I'm thinking that the user could use the regular wallclock limit setting in slurm and the script could remove that and use it to set the PreemptExemptTime.
Thanks, Davide
On Thu, Jun 12, 2025 at 3:56 AM Ansgar Esztermann-Kirchner via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi Davide,
I think it should be possible to emulate this via preemption: if you set PreemptMode to CANCEL, a preempted job will behave just as if it reached the end of its wall time. Then, you can use PreemptExemptTime as your soft wall time limit -- the job will not be preempted before PreemptExemptTime has passed.
See https://slurm.schedmd.com/preempt.html
Best,
A.
-- Ansgar Esztermann Sysadmin Dep. Theoretical and Computational Biophysics https://www.mpinat.mpg.de/person/11315/3883774
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
On Thu, Jun 12, 2025 at 04:52:24AM -0600, Davide DelVento wrote:
Hi Ansgar,
This is indeed what I was looking for: I was not aware of PreemptExemptTime.
From my cursory glance at the documentation, it seems that PreemptExemptTime is QOS-based and not job based though. Is that correct? Or could it be set per-job, perhaps on a prolog/submit lua script?
Yes, that's correct. I guess you could create a bunch of QOS with different PremptExemptTimes and then let the user select one (or indeed select it from lua) but as far as I know, there is no way to set arbitrary per-job values.
Best,
A.
Sounds good, thanks for confirming it. Let me sleep on it wrt the "too many" QOS, or think if I should ditch this idea. If I'll implement it, I'll post in this conversation details on how I did it. Cheers
On Thu, Jun 12, 2025 at 6:59 AM Ansgar Esztermann-Kirchner < aeszter@mpinat.mpg.de> wrote:
On Thu, Jun 12, 2025 at 04:52:24AM -0600, Davide DelVento wrote:
Hi Ansgar,
This is indeed what I was looking for: I was not aware of
PreemptExemptTime.
From my cursory glance at the documentation, it seems that PreemptExemptTime is QOS-based and not job based though. Is that correct? Or could it be set per-job, perhaps on a prolog/submit lua
script?
Yes, that's correct. I guess you could create a bunch of QOS with different PremptExemptTimes and then let the user select one (or indeed select it from lua) but as far as I know, there is no way to set arbitrary per-job values.
Best,
A.
Ansgar Esztermann Sysadmin Dep. Theoretical and Computational Biophysics https://www.mpinat.mpg.de/person/11315/3883774