Implementing a "soft" wall clock limit

List overview All Threads
Download

newer

older

Job information if job is completed

enforce Qos to users

Davide DelVento

11 Jun 2025 11 Jun '25

11:10 a.m.

In the institution where I work, so far we have managed to live without mandatory wallclock limits (a policy decided well before I joined the organization), and that has been possible because the cluster was not very much utilized.

Now that is changing, with more jobs being submitted and those being larger ones. As such I would like to introduce wallclock limits to allow slurm to be more efficient in scheduling jobs, including with backfill.

My concern is that this user base is not used to it and therefore I want to make it easier for them, and avoid common complaints. I anticipate one of them would be "my job was cancelled even though there were enough nodes idle and no other job in line after mine" (since the cluster utilization is increasing, but not yet always full like it has been at most other places I know).

So my question is: is it possible to implement "soft" wallclock limits in slurm, namely ones which would not be enforced unless necessary to run more jobs? In other words, is it possible to change the pre-emptability of a job only after some time has passed? I can think of some ways to hack this functionality myself with some cron or at jobs, and that might be easy enough to do, but I am not sure I can make it robust enough to cover all situations, so I'm looking for something either slurm-native or (if external solution) field-tested by someone else already, so that at least the worst kinks have been already ironed out.

Thanks in advance for any suggestions you may provide!

Attachments:

attachment.html (text/html — 1.6 KB)

Show replies by date

Loris Bennett

11 Jun 11 Jun

12:11 p.m.

Hi Davide,

Davide DelVento via slurm-users slurm-users@lists.schedmd.com writes:

...

In the institution where I work, so far we have managed to live without mandatory wallclock limits (a policy decided well before I joined the organization), and that has been possible because the cluster was not very much utilized.

Now that is changing, with more jobs being submitted and those being larger ones. As such I would like to introduce wallclock limits to allow slurm to be more efficient in scheduling jobs, including with backfill.

My concern is that this user base is not used to it and therefore I want to make it easier for them, and avoid common complaints. I anticipate one of them would be "my job was cancelled even though there were enough nodes idle and no other job in line after mine" (since the cluster utilization is increasing, but not yet always full like it has been at most other places I know).

So my question is: is it possible to implement "soft" wallclock limits in slurm, namely ones which would not be enforced unless necessary to run more jobs? In other words, is it possible to change the pre-emptability of a job only after some time has passed? I can think of some ways to hack this functionality myself with some cron or at jobs, and that might be easy enough to do, but I am not sure I can make it robust enough to cover all situations, so I'm looking for something either slurm-native or (if external solution) field-tested by someone else already, so that at least the worst kinks have been already ironed out.

Thanks in advance for any suggestions you may provide!

We just have a default wallclock limit of 14 days, but we also have QOS with shorter wallclock limits but with higher priorities, albeit with for fewer jobs and resources:

$ sqos Name Priority MaxWall MaxJobs MaxSubmit MaxTRESPU ---------- ---------- ----------- ------- --------- -------------------- hiprio 100000 03:00:00 50 100 cpu=128,gres/gpu=4 prio 1000 3-00:00:00 500 1000 cpu=256,gres/gpu=8 standard 0 14-00:00:00 2000 10000 cpu=768,gres/gpu=16

We also have a page of documentation which explains how users can profit from backfill. Thus users have a certain incentive to specify a shorter wallclock limit, if they can.

'sqos' is just an alias for

sacctmgr show qos format=name,priority,maxwall,maxjobs,maxsubmitjobs,maxtrespu%20

Cheers,

Loris

-- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin

Davide DelVento

12:37 p.m.

Thanks Loris,

Am I correct if reading in between the lines you're saying: rather than going on with my "soft" limit idea, just use the regular hard limits, being generous with the default and providing user education instead? In fact that is an alternative approach that I am considering too.

On Wed, Jun 11, 2025 at 6:15 AM Loris Bennett via slurm-users < slurm-users@lists.schedmd.com> wrote:

...

Hi Davide,

Davide DelVento via slurm-users slurm-users@lists.schedmd.com writes:

...
In the institution where I work, so far we have managed to live without mandatory wallclock limits (a policy decided well before I joined the organization), and that has been possible because the cluster was not very much utilized.

Now that is changing, with more jobs being submitted and those being larger ones. As such I would like to introduce wallclock limits to allow slurm to be more efficient in scheduling jobs, including with backfill.

My concern is that this user base is not used to it and therefore I want to make it easier for them, and avoid common complaints. I anticipate one of them would be "my job was cancelled even though there were enough nodes idle and no other job in line after mine" (since the cluster utilization is increasing, but not yet always full like it has been at most other places I know).

So my question is: is it possible to implement "soft" wallclock limits in slurm, namely ones which would not be enforced unless necessary to run more jobs? In other words, is it possible to change the pre-emptability of a job only after some time has passed? I can think of some ways to hack this functionality myself with some cron or at jobs, and that might be easy enough to do, but I am not sure I can make it robust enough to cover all situations, so I'm looking for something either slurm-native or (if external solution) field-tested by someone else already, so that at least the worst kinks have been already ironed out.

Thanks in advance for any suggestions you may provide!

We just have a default wallclock limit of 14 days, but we also have QOS with shorter wallclock limits but with higher priorities, albeit with for fewer jobs and resources:

$ sqos Name Priority MaxWall MaxJobs MaxSubmit MaxTRESPU
hiprio     100000    03:00:00      50       100   cpu=128,gres/gpu=4
  prio       1000  3-00:00:00     500      1000   cpu=256,gres/gpu=8
standard 0 14-00:00:00 2000 10000 cpu=768,gres/gpu=16

We also have a page of documentation which explains how users can profit from backfill. Thus users have a certain incentive to specify a shorter wallclock limit, if they can.

'sqos' is just an alias for

sacctmgr show qos format=name,priority,maxwall,maxjobs,maxsubmitjobs,maxtrespu%20

Cheers,

Loris

-- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Loris Bennett

12:50 p.m.

Hi Davide,

Davide DelVento davide.quantum@gmail.com writes:

...

Thanks Loris,

Am I correct if reading in between the lines you're saying: rather than going on with my "soft" limit idea, just use the regular hard limits, being generous with the default and providing user education instead? In fact that is an alternative approach that I am considering too.

Yes. We in fact never get anyone complaining about jobs being cancelled due to having reached their time-limit, even though other resources were idle. Having said that, we do occasionally extend the time-limit for individual jobs, when requested. We also don't pre-empt any jobs.

Apart from that, I imaging implementing your 'soft' limits robustly might be quite challenging and/or time-consuming, as I am not aware that Slurm has anything like that built in.

Cheers,

Loris

...

On Wed, Jun 11, 2025 at 6:15 AM Loris Bennett via slurm-users slurm-users@lists.schedmd.com wrote:

Hi Davide,

Davide DelVento via slurm-users slurm-users@lists.schedmd.com writes:

...
In the institution where I work, so far we have managed to live without mandatory wallclock limits (a policy decided well before I joined the organization), and that has been possible because the cluster was not very much utilized.

Now that is changing, with more jobs being submitted and those being larger ones. As such I would like to introduce wallclock limits to allow slurm to be more efficient in scheduling jobs, including with backfill.

My concern is that this user base is not used to it and therefore I want to make it easier for them, and avoid common complaints. I anticipate one of them would be "my job was cancelled even though there were enough nodes idle and no other job in line after mine" (since the cluster utilization is increasing, but not yet always full like it has been at most other places I know).

So my question is: is it possible to implement "soft" wallclock limits in slurm, namely ones which would not be enforced unless necessary to run more jobs? In other words, is it possible to change the pre-emptability of a job only after some time has passed? I can think of some ways to hack this functionality myself with some cron or at jobs, and that might be easy enough to do, but I am not sure I can make it robust enough to cover all situations, so I'm looking for something either slurm-native or (if external solution) field-tested by someone else already, so that at least the worst kinks have been already ironed out.

Thanks in advance for any suggestions you may provide!

We just have a default wallclock limit of 14 days, but we also have QOS with shorter wallclock limits but with higher priorities, albeit with for fewer jobs and resources:

$ sqos Name Priority MaxWall MaxJobs MaxSubmit MaxTRESPU
 hiprio     100000    03:00:00      50       100   cpu=128,gres/gpu=4
   prio       1000  3-00:00:00     500      1000   cpu=256,gres/gpu=8
standard 0 14-00:00:00 2000 10000 cpu=768,gres/gpu=16

We also have a page of documentation which explains how users can profit from backfill. Thus users have a certain incentive to specify a shorter wallclock limit, if they can.

'sqos' is just an alias for

sacctmgr show qos format=name,priority,maxwall,maxjobs,maxsubmitjobs,maxtrespu%20

Cheers,

Loris

-- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

-- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin

Ansgar Esztermann-Kirchner

12 Jun 12 Jun

9:54 a.m.

Hi Davide,

I think it should be possible to emulate this via preemption: if you set PreemptMode to CANCEL, a preempted job will behave just as if it reached the end of its wall time. Then, you can use PreemptExemptTime as your soft wall time limit -- the job will not be preempted before PreemptExemptTime has passed.

See https://slurm.schedmd.com/preempt.html

Best,

-- Ansgar Esztermann Sysadmin Dep. Theoretical and Computational Biophysics https://www.mpinat.mpg.de/person/11315/3883774

Davide DelVento

10:52 a.m.

Hi Ansgar,

This is indeed what I was looking for: I was not aware of PreemptExemptTime.

From my cursory glance at the documentation, it seems that PreemptExemptTime is QOS-based and not job based though. Is that correct? Or could it be set per-job, perhaps on a prolog/submit lua script? I'm thinking that the user could use the regular wallclock limit setting in slurm and the script could remove that and use it to set the PreemptExemptTime.

Thanks, Davide

On Thu, Jun 12, 2025 at 3:56 AM Ansgar Esztermann-Kirchner via slurm-users < slurm-users@lists.schedmd.com> wrote:

...

Hi Davide,

I think it should be possible to emulate this via preemption: if you set PreemptMode to CANCEL, a preempted job will behave just as if it reached the end of its wall time. Then, you can use PreemptExemptTime as your soft wall time limit -- the job will not be preempted before PreemptExemptTime has passed.

See https://slurm.schedmd.com/preempt.html

Best,

A.

-- Ansgar Esztermann Sysadmin Dep. Theoretical and Computational Biophysics https://www.mpinat.mpg.de/person/11315/3883774

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Ansgar Esztermann-Kirchner

12:59 p.m.

On Thu, Jun 12, 2025 at 04:52:24AM -0600, Davide DelVento wrote:

...

Hi Ansgar,

This is indeed what I was looking for: I was not aware of PreemptExemptTime.

From my cursory glance at the documentation, it seems that PreemptExemptTime is QOS-based and not job based though. Is that correct? Or could it be set per-job, perhaps on a prolog/submit lua script?

Yes, that's correct. I guess you could create a bunch of QOS with different PremptExemptTimes and then let the user select one (or indeed select it from lua) but as far as I know, there is no way to set arbitrary per-job values.

Best,

-- Ansgar Esztermann Sysadmin Dep. Theoretical and Computational Biophysics https://www.mpinat.mpg.de/person/11315/3883774

Davide DelVento

13 Jun 13 Jun

1:11 a.m.

Sounds good, thanks for confirming it. Let me sleep on it wrt the "too many" QOS, or think if I should ditch this idea. If I'll implement it, I'll post in this conversation details on how I did it. Cheers

On Thu, Jun 12, 2025 at 6:59 AM Ansgar Esztermann-Kirchner < aeszter@mpinat.mpg.de> wrote:

...

On Thu, Jun 12, 2025 at 04:52:24AM -0600, Davide DelVento wrote:

...
Hi Ansgar,

This is indeed what I was looking for: I was not aware of

PreemptExemptTime.

...
From my cursory glance at the documentation, it seems that PreemptExemptTime is QOS-based and not job based though. Is that correct? Or could it be set per-job, perhaps on a prolog/submit lua

script?

Yes, that's correct. I guess you could create a bunch of QOS with different PremptExemptTimes and then let the user select one (or indeed select it from lua) but as far as I know, there is no way to set arbitrary per-job values.

Best,

A.

Ansgar Esztermann Sysadmin Dep. Theoretical and Computational Biophysics https://www.mpinat.mpg.de/person/11315/3883774

Prentice Bisbal

16 Jun 16 Jun

9 p.m.

I think the idea of having a generous default timelimit is the wrong way to go. In fact, I think any defaults for jobs are a bad way to go. The majority of your users will just use that default time limit, and backfill scheduling will remain useless to you.

Instead, I recommend you use your job_submit.lua to reject all jobs that don't have a wallclock time and print out a helpful error message to inform users they now need to specify a wallclock time, and provide a link to documentation on how to do that.

Requiring users to specify a time limit themselves does two things:

1. It reminds them that it's important to be conscious of timelimits when submitting jobs

2. If a job is killed before it's done and all the progress is lost because the job wasn't checkpointing, they can't blame you as the admin.

If you do this, it's easy to get the users on board by first providing useful and usable documentation on why timelimits are needed and how to set them. Be sure to hammer home the point that effective timelimits can lead to their jobs running sooner, and that effective timelimits can increase cluster efficiency/utilization, helping them get a better return on their investment (if they contribute to the clusters cost) or they'll get more science done. I like to frame it that accurate wallclock times will give them a competitive edge in getting their jobs running before other cluster users. Everyone likes to think what they're doing will give them an advantage!

My 4 cents (adjusted for inflation).

Prentice

On 6/12/25 9:11 PM, Davide DelVento via slurm-users wrote:

...

On Thu, Jun 12, 2025 at 6:59 AM Ansgar Esztermann-Kirchner aeszter@mpinat.mpg.de wrote:

On Thu, Jun 12, 2025 at 04:52:24AM -0600, Davide DelVento wrote:
> Hi Ansgar,
>
> This is indeed what I was looking for: I was not aware of
PreemptExemptTime.
>
> From my cursory glance at the documentation, it seems
> that PreemptExemptTime is QOS-based and not job based though. Is
that
> correct? Or could it be set per-job, perhaps on a prolog/submit
lua script?

Yes, that's correct.
I guess you could create a bunch of QOS with different
PremptExemptTimes and then let the user select one (or indeed select
it from lua) but as far as I know, there is no way to set arbitrary
per-job values.

Best,

A.
-- 
Ansgar Esztermann
Sysadmin Dep. Theoretical and Computational Biophysics
https://www.mpinat.mpg.de/person/11315/3883774

-- Prentice Bisbal HPC Systems Engineer III Computational & Information Systems Laboratory (CISL) NSF National Center for Atmospheric Research (NSF NCAR) https://www.cisl.ucar.edu https://ncar.ucar.edu

Loris Bennett

17 Jun 17 Jun

6:23 a.m.

Hi Prentice,

Prentice Bisbal via slurm-users

slurm-users@lists.schedmd.com writes:

...

I think the idea of having a generous default timelimit is the wrong way to go. In fact, I think any defaults for jobs are a bad way to go. The majority of your users will just use that default time limit, and backfill scheduling will remain useless to you.

Horses for courses, I would say. We have a default time of 14 days, but because we also have QoS with increased priority, but shorter time limits, there is still an incentive for users to set the time limit themselves. So currently we have around 900 jobs running, only 100 of which are using the default time limit. Many of these will be long-running Gaussian jobs and will indeed need the time.

...

Instead, I recommend you use your job_submit.lua to reject all jobs that don't have a wallclock time and print out a helpful error message to inform users they now need to specify a wallclock time, and provide a link to documentation on how to do that.

Requiring users to specify a time limit themselves does two things:

It reminds them that it's important to be conscious of timelimits when submitting jobs

This is a good point. We use 'jobstats', which provides information after a job has completed, about run time relative to time limit, amongst other things, although unfortunately many people don't seem to read this. However, even if you do force people to set a time limit, they can still choose not to think about it and just set the maximum.

...

If a job is killed before it's done and all the progress is lost because the job wasn't checkpointing, they can't blame you as the admin.

I don't really understand this point. The limit is just the way it is, just as we have caps on the total number of cores or GPUs the jobs given user can use at any one time. Up to now no-one has blamed us for this.

...

If you do this, it's easy to get the users on board by first providing useful and usable documentation on why timelimits are needed and how to set them. Be sure to hammer home the point that effective timelimits can lead to their jobs running sooner, and that effective timelimits can increase cluster efficiency/utilization, helping them get a better return on their investment (if they contribute to the clusters cost) or they'll get more science done. I like to frame it that accurate wallclock times will give them a competitive edge in getting their jobs running before other cluster users. Everyone likes to think what they're doing will give them an advantage!

I agree with all this and this is also what we also try to do. The only thing I don't concur with is your last sentence. In my experience, as long as things work, users will in general not give a fig about whether they are using resources efficiently. Only when people notice a delay in jobs starting do they become more aware about it and are prepared to take action. It is particularly a problem with new users, because fairshare means that their jobs will start pretty quickly, no matter how inefficiently they have configured them. Maybe we should just give new users fewer share initially and only later bump them up to some standard value.

Cheers,

Loris

...

My 4 cents (adjusted for inflation).

Prentice

On 6/12/25 9:11 PM, Davide DelVento via slurm-users wrote:

Sounds good, thanks for confirming it. Let me sleep on it wrt the "too many" QOS, or think if I should ditch this idea. If I'll implement it, I'll post in this conversation details on how I did it. Cheers

On Thu, Jun 12, 2025 at 6:59 AM Ansgar Esztermann-Kirchner aeszter@mpinat.mpg.de wrote:

On Thu, Jun 12, 2025 at 04:52:24AM -0600, Davide DelVento wrote:

...
Hi Ansgar,

This is indeed what I was looking for: I was not aware of PreemptExemptTime.

From my cursory glance at the documentation, it seems that PreemptExemptTime is QOS-based and not job based though. Is that correct? Or could it be set per-job, perhaps on a prolog/submit lua script?

Yes, that's correct. I guess you could create a bunch of QOS with different PremptExemptTimes and then let the user select one (or indeed select it from lua) but as far as I know, there is no way to set arbitrary per-job values.

Best,

A.

Ansgar Esztermann Sysadmin Dep. Theoretical and Computational Biophysics https://www.mpinat.mpg.de/person/11315/3883774

-- Prentice Bisbal HPC Systems Engineer III Computational & Information Systems Laboratory (CISL) NSF National Center for Atmospheric Research (NSF NCAR) https://www.cisl.ucar.edu https://ncar.ucar.edu

-- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin

Davide DelVento

12:15 p.m.

This conversation is drifting a bit away from my initial questions and covering various other related topics. In fact I do agree with almost everything written in the last few messages. However, that is somewhat orthogonal to my initial request, which I now understand has the answer "not possible with slurm configuration, possible with ugly hacks which are probably error prone and not worth the hassle". Just for the sake of the discussion (since I'm enjoying hearing the various perspectives) I'll restate my request and why I think slurm does not support this need.

Most clusters have very high utilization all the time. This is good for ROI etc but annoying to users. Forcing users to specify a firm wallclock limit helps slurm make good scheduling decisions, which keep utilization (ROI, etc) high and minimizes wait time for everybody.

At the place where I work there is a quite different situation: there are moments of high pressure and long wait, and there are moments in which its utilization drops under 50% and sometimes even under 25% (e.g. during long weekends). We can have a discussion about it, but the bottom line is that management (ROI, etc) is fine with it, so that's the way it is. This circumstance, I agree, is quite peculiar and not shared by any other place I worked before or where I ever had an account and saw how things were, but that is what it is. In this circumstance it feels at least silly and perhaps even extremely wasteful and annoying to let slurm cancel jobs at their wallclock limit without considering other context. I mean, imagine a user with a weeklong job who estimated a 7 day wallclock limit and "for good measure" requested 8 days, but then the job would actually take 9 days. Imagine that the 8th day happened in the middle of on a long weekend when utilization was 25% and there was not a single other job pending. Maybe this job is a one-off experiment quickly cobbled together to test one thing, so it's not a well-designed piece of code and does not have checkpoint-restart capabilities. Why enforce the wallclock limit in that situation?

The way around this problem in the past was to simply not make the wallclock limit mandatory (which was decided by my predecessor, who has now left). That worked, only because the cluster was not in a very good usability status so most people avoided it anyway and there seldom was a long line of jobs pending in the queue, so slurm did not need to work very hard to schedule things. Now that I've improved the usability situation, this has become a problem, because utilization has become much higher. Perhaps in a short time people will learn to plan ahead and submit more jobs and fill the machine up during the weekends too (I'm working on user education towards that), and if that happens, it will make the above dilemma go away. But for now I have it.

I'm still mulling on how to best proceed. Maybe just force the users to set a wallclock limit and live with it.

Here is another idea that just came to me. Does slurm have a "global" switch to turn on/off cancelling jobs hitting their wallclock limit? If so, I could have a cron job checking if there are pending jobs in the queue and if not shut it off, and if so turn it on. Granted, that may be sloppy (e.g. one job pending for one resources causing the cancelling of jobs using other resources) but it's something and it'd be easy to implement compared to the turn on/off pre-emption as discussed in a previous message.

Great conversation folks, enjoying reading the various perspectives at different sites!

On Tue, Jun 17, 2025 at 12:26 AM Loris Bennett via slurm-users < slurm-users@lists.schedmd.com> wrote:

...

Hi Prentice,

Prentice Bisbal via slurm-users

slurm-users@lists.schedmd.com writes:

...
I think the idea of having a generous default timelimit is the wrong way

to go. In fact, I think any defaults for jobs are a bad way to go. The majority of your

...
users will just use that default time limit, and backfill scheduling

will remain useless to you.

Horses for courses, I would say. We have a default time of 14 days, but because we also have QoS with increased priority, but shorter time limits, there is still an incentive for users to set the time limit themselves. So currently we have around 900 jobs running, only 100 of which are using the default time limit. Many of these will be long-running Gaussian jobs and will indeed need the time.

...
Instead, I recommend you use your job_submit.lua to reject all jobs that

don't have a wallclock time and print out a helpful error message to inform users they

...
now need to specify a wallclock time, and provide a link to

documentation on how to do that.

...
Requiring users to specify a time limit themselves does two things:

It reminds them that it's important to be conscious of timelimits

when submitting jobs

This is a good point. We use 'jobstats', which provides information after a job has completed, about run time relative to time limit, amongst other things, although unfortunately many people don't seem to read this. However, even if you do force people to set a time limit, they can still choose not to think about it and just set the maximum.

...

If a job is killed before it's done and all the progress is lost

because the job wasn't checkpointing, they can't blame you as the admin.

I don't really understand this point. The limit is just the way it is, just as we have caps on the total number of cores or GPUs the jobs given user can use at any one time. Up to now no-one has blamed us for this.

...
If you do this, it's easy to get the users on board by first providing

useful and usable documentation on why timelimits are needed and how to set them. Be

...
sure to hammer home the point that effective timelimits can lead to

their jobs running sooner, and that effective timelimits can increase cluster

...
efficiency/utilization, helping them get a better return on their

investment (if they contribute to the clusters cost) or they'll get more science done. I like to

...
frame it that accurate wallclock times will give them a competitive edge

in getting their jobs running before other cluster users. Everyone likes to think what

...
they're doing will give them an advantage!

I agree with all this and this is also what we also try to do. The only thing I don't concur with is your last sentence. In my experience, as long as things work, users will in general not give a fig about whether they are using resources efficiently. Only when people notice a delay in jobs starting do they become more aware about it and are prepared to take action. It is particularly a problem with new users, because fairshare means that their jobs will start pretty quickly, no matter how inefficiently they have configured them. Maybe we should just give new users fewer share initially and only later bump them up to some standard value.

Cheers,

Loris

...
My 4 cents (adjusted for inflation).

Prentice

On 6/12/25 9:11 PM, Davide DelVento via slurm-users wrote:

Sounds good, thanks for confirming it. Let me sleep on it wrt the "too many" QOS, or think if I should ditch

this idea.

...
If I'll implement it, I'll post in this conversation details on how I

did it.

...
Cheers

On Thu, Jun 12, 2025 at 6:59 AM Ansgar Esztermann-Kirchner <

aeszter@mpinat.mpg.de> wrote:

...
On Thu, Jun 12, 2025 at 04:52:24AM -0600, Davide DelVento wrote:

...
Hi Ansgar,

This is indeed what I was looking for: I was not aware of

PreemptExemptTime.

...
...
From my cursory glance at the documentation, it seems that PreemptExemptTime is QOS-based and not job based though. Is that correct? Or could it be set per-job, perhaps on a prolog/submit lua

script?

...
Yes, that's correct. I guess you could create a bunch of QOS with different PremptExemptTimes and then let the user select one (or indeed select it from lua) but as far as I know, there is no way to set arbitrary per-job values.

Best,

A.

Ansgar Esztermann Sysadmin Dep. Theoretical and Computational Biophysics https://www.mpinat.mpg.de/person/11315/3883774

-- Prentice Bisbal HPC Systems Engineer III Computational & Information Systems Laboratory (CISL) NSF National Center for Atmospheric Research (NSF NCAR) https://www.cisl.ucar.edu https://ncar.ucar.edu

-- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Michael DiDomenico

1:18 p.m.

there is one reason to have a hard time limit. garbage collection. if your institution is like mine, visiting academics often come and go and many full-timers are "forgetful" of what they startup. at some point someone has to clean all that up

On Tue, Jun 17, 2025 at 8:18 AM Davide DelVento via slurm-users slurm-users@lists.schedmd.com wrote:

...

This conversation is drifting a bit away from my initial questions and covering various other related topics. In fact I do agree with almost everything written in the last few messages. However, that is somewhat orthogonal to my initial request, which I now understand has the answer "not possible with slurm configuration, possible with ugly hacks which are probably error prone and not worth the hassle". Just for the sake of the discussion (since I'm enjoying hearing the various perspectives) I'll restate my request and why I think slurm does not support this need.

Most clusters have very high utilization all the time. This is good for ROI etc but annoying to users. Forcing users to specify a firm wallclock limit helps slurm make good scheduling decisions, which keep utilization (ROI, etc) high and minimizes wait time for everybody.

At the place where I work there is a quite different situation: there are moments of high pressure and long wait, and there are moments in which its utilization drops under 50% and sometimes even under 25% (e.g. during long weekends). We can have a discussion about it, but the bottom line is that management (ROI, etc) is fine with it, so that's the way it is. This circumstance, I agree, is quite peculiar and not shared by any other place I worked before or where I ever had an account and saw how things were, but that is what it is. In this circumstance it feels at least silly and perhaps even extremely wasteful and annoying to let slurm cancel jobs at their wallclock limit without considering other context. I mean, imagine a user with a weeklong job who estimated a 7 day wallclock limit and "for good measure" requested 8 days, but then the job would actually take 9 days. Imagine that the 8th day happened in the middle of on a long weekend when utilization was 25% and there was not a single other job pending. Maybe this job is a one-off experiment quickly cobbled together to test one thing, so it's not a well-designed piece of code and does not have checkpoint-restart capabilities. Why enforce the wallclock limit in that situation?

The way around this problem in the past was to simply not make the wallclock limit mandatory (which was decided by my predecessor, who has now left). That worked, only because the cluster was not in a very good usability status so most people avoided it anyway and there seldom was a long line of jobs pending in the queue, so slurm did not need to work very hard to schedule things. Now that I've improved the usability situation, this has become a problem, because utilization has become much higher. Perhaps in a short time people will learn to plan ahead and submit more jobs and fill the machine up during the weekends too (I'm working on user education towards that), and if that happens, it will make the above dilemma go away. But for now I have it.

I'm still mulling on how to best proceed. Maybe just force the users to set a wallclock limit and live with it.

Here is another idea that just came to me. Does slurm have a "global" switch to turn on/off cancelling jobs hitting their wallclock limit? If so, I could have a cron job checking if there are pending jobs in the queue and if not shut it off, and if so turn it on. Granted, that may be sloppy (e.g. one job pending for one resources causing the cancelling of jobs using other resources) but it's something and it'd be easy to implement compared to the turn on/off pre-emption as discussed in a previous message.

Great conversation folks, enjoying reading the various perspectives at different sites!

On Tue, Jun 17, 2025 at 12:26 AM Loris Bennett via slurm-users slurm-users@lists.schedmd.com wrote:

...
Hi Prentice,

Prentice Bisbal via slurm-users

slurm-users@lists.schedmd.com writes:

...
I think the idea of having a generous default timelimit is the wrong way to go. In fact, I think any defaults for jobs are a bad way to go. The majority of your users will just use that default time limit, and backfill scheduling will remain useless to you.

Horses for courses, I would say. We have a default time of 14 days, but because we also have QoS with increased priority, but shorter time limits, there is still an incentive for users to set the time limit themselves. So currently we have around 900 jobs running, only 100 of which are using the default time limit. Many of these will be long-running Gaussian jobs and will indeed need the time.

...
Instead, I recommend you use your job_submit.lua to reject all jobs that don't have a wallclock time and print out a helpful error message to inform users they now need to specify a wallclock time, and provide a link to documentation on how to do that.

Requiring users to specify a time limit themselves does two things:

It reminds them that it's important to be conscious of timelimits when submitting jobs

This is a good point. We use 'jobstats', which provides information after a job has completed, about run time relative to time limit, amongst other things, although unfortunately many people don't seem to read this. However, even if you do force people to set a time limit, they can still choose not to think about it and just set the maximum.

...

If a job is killed before it's done and all the progress is lost because the job wasn't checkpointing, they can't blame you as the admin.

I don't really understand this point. The limit is just the way it is, just as we have caps on the total number of cores or GPUs the jobs given user can use at any one time. Up to now no-one has blamed us for this.

...
If you do this, it's easy to get the users on board by first providing useful and usable documentation on why timelimits are needed and how to set them. Be sure to hammer home the point that effective timelimits can lead to their jobs running sooner, and that effective timelimits can increase cluster efficiency/utilization, helping them get a better return on their investment (if they contribute to the clusters cost) or they'll get more science done. I like to frame it that accurate wallclock times will give them a competitive edge in getting their jobs running before other cluster users. Everyone likes to think what they're doing will give them an advantage!

I agree with all this and this is also what we also try to do. The only thing I don't concur with is your last sentence. In my experience, as long as things work, users will in general not give a fig about whether they are using resources efficiently. Only when people notice a delay in jobs starting do they become more aware about it and are prepared to take action. It is particularly a problem with new users, because fairshare means that their jobs will start pretty quickly, no matter how inefficiently they have configured them. Maybe we should just give new users fewer share initially and only later bump them up to some standard value.

Cheers,

Loris

...
My 4 cents (adjusted for inflation).

Prentice

On 6/12/25 9:11 PM, Davide DelVento via slurm-users wrote:

Sounds good, thanks for confirming it. Let me sleep on it wrt the "too many" QOS, or think if I should ditch this idea. If I'll implement it, I'll post in this conversation details on how I did it. Cheers

On Thu, Jun 12, 2025 at 6:59 AM Ansgar Esztermann-Kirchner aeszter@mpinat.mpg.de wrote:

On Thu, Jun 12, 2025 at 04:52:24AM -0600, Davide DelVento wrote:

...
Hi Ansgar,

This is indeed what I was looking for: I was not aware of PreemptExemptTime.

From my cursory glance at the documentation, it seems that PreemptExemptTime is QOS-based and not job based though. Is that correct? Or could it be set per-job, perhaps on a prolog/submit lua script?

Yes, that's correct. I guess you could create a bunch of QOS with different PremptExemptTimes and then let the user select one (or indeed select it from lua) but as far as I know, there is no way to set arbitrary per-job values.

Best,

A.

Ansgar Esztermann Sysadmin Dep. Theoretical and Computational Biophysics https://www.mpinat.mpg.de/person/11315/3883774

-- Prentice Bisbal HPC Systems Engineer III Computational & Information Systems Laboratory (CISL) NSF National Center for Atmospheric Research (NSF NCAR) https://www.cisl.ucar.edu https://ncar.ucar.edu

-- Dr. Loris Bennett (Herr/Mr) FUB-IT, Freie Universität Berlin

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Smith, Sebastian

20 Jun 20 Jun

6:36 p.m.

Another 4 cents:

I think automatically increasing job time limits, or otherwise disabling job termination due to time, will cause you headaches down the road. Coupling Slurm’s behavior to the effort of the cluster, or other state, will be difficult to communicate to users because the behavior of their jobs becomes non-deterministic. You’ll answer a lot of questions that start, “this job completed the last time I ran it…”. And, you’ll have to evaluate the state of the system near the job time limit to understand what happened (write good logs!!). I’d avoid playing the detective at that, recurring, crime scene…

I forced my users to specify time limits and they quickly adapted:

`JobSubmitPlugins=require_timelimit`

Good luck!

Sebastian

From: Michael DiDomenico via slurm-users slurm-users@lists.schedmd.com Date: Tuesday, June 17, 2025 at 06:20 To: Cc: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: Implementing a "soft" wall clock limit there is one reason to have a hard time limit. garbage collection. if your institution is like mine, visiting academics often come and go and many full-timers are "forgetful" of what they startup. at some point someone has to clean all that

there is one reason to have a hard time limit. garbage collection.

if your institution is like mine, visiting academics often come and go

and many full-timers are "forgetful" of what they startup. at some

point someone has to clean all that up

On Tue, Jun 17, 2025 at 8:18 AM Davide DelVento via slurm-users

slurm-users@lists.schedmd.com wrote:

...

This conversation is drifting a bit away from my initial questions and covering various other related topics. In fact I do agree with almost everything written in the last few messages. However, that is somewhat orthogonal to my initial request, which I now understand has the answer "not possible with slurm configuration, possible with ugly hacks which are probably error prone and not worth the hassle". Just for the sake of the discussion (since I'm enjoying hearing the various perspectives) I'll restate my request and why I think slurm does not support this need.

...

Most clusters have very high utilization all the time. This is good for ROI etc but annoying to users. Forcing users to specify a firm wallclock limit helps slurm make good scheduling decisions, which keep utilization (ROI, etc) high and minimizes wait time for everybody.

...

At the place where I work there is a quite different situation: there are moments of high pressure and long wait, and there are moments in which its utilization drops under 50% and sometimes even under 25% (e.g. during long weekends). We can have a discussion about it, but the bottom line is that management (ROI, etc) is fine with it, so that's the way it is. This circumstance, I agree, is quite peculiar and not shared by any other place I worked before or where I ever had an account and saw how things were, but that is what it is. In this circumstance it feels at least silly and perhaps even extremely wasteful and annoying to let slurm cancel jobs at their wallclock limit without considering other context. I mean, imagine a user with a weeklong job who estimated a 7 day wallclock limit and "for good measure" requested 8 days, but then the job would actually take 9 days. Imagine that the 8th day happened in the middle of on a long weekend when utilization was 25% and there was not a single other job pending. Maybe this job is a one-off experiment quickly cobbled together to test one thing, so it's not a well-designed piece of code and does not have checkpoint-restart capabilities. Why enforce the wallclock limit in that situation?

...

The way around this problem in the past was to simply not make the wallclock limit mandatory (which was decided by my predecessor, who has now left). That worked, only because the cluster was not in a very good usability status so most people avoided it anyway and there seldom was a long line of jobs pending in the queue, so slurm did not need to work very hard to schedule things. Now that I've improved the usability situation, this has become a problem, because utilization has become much higher. Perhaps in a short time people will learn to plan ahead and submit more jobs and fill the machine up during the weekends too (I'm working on user education towards that), and if that happens, it will make the above dilemma go away. But for now I have it.

...

I'm still mulling on how to best proceed. Maybe just force the users to set a wallclock limit and live with it.

...

Here is another idea that just came to me. Does slurm have a "global" switch to turn on/off cancelling jobs hitting their wallclock limit? If so, I could have a cron job checking if there are pending jobs in the queue and if not shut it off, and if so turn it on. Granted, that may be sloppy (e.g. one job pending for one resources causing the cancelling of jobs using other resources) but it's something and it'd be easy to implement compared to the turn on/off pre-emption as discussed in a previous message.

...

Great conversation folks, enjoying reading the various perspectives at different sites!

...

On Tue, Jun 17, 2025 at 12:26 AM Loris Bennett via slurm-users slurm-users@lists.schedmd.com wrote:

...

...

...

...
Hi Prentice,

...

...

...

...
Prentice Bisbal via slurm-users

...

...

...

...
slurm-users@lists.schedmd.com writes:

...

...

...

...
...
I think the idea of having a generous default timelimit is the wrong way to go. In fact, I think any defaults for jobs are a bad way to go. The majority of your

...

...
...
users will just use that default time limit, and backfill scheduling will remain useless to you.

...

...

...

...
Horses for courses, I would say. We have a default time of 14 days, but

...

...
because we also have QoS with increased priority, but shorter time

...

...
limits, there is still an incentive for users to set the time limit

...

...
themselves. So currently we have around 900 jobs running, only 100 of

...

...
which are using the default time limit. Many of these will be

...

...
long-running Gaussian jobs and will indeed need the time.

...

...

...

...
...
Instead, I recommend you use your job_submit.lua to reject all jobs that don't have a wallclock time and print out a helpful error message to inform users they

...

...
...
now need to specify a wallclock time, and provide a link to documentation on how to do that.

...

...
...

...

...
...
Requiring users to specify a time limit themselves does two things:

...

...
...

...

...
...

It reminds them that it's important to be conscious of timelimits when submitting jobs

...

...

...

...
This is a good point. We use 'jobstats', which provides information

...

...
after a job has completed, about run time relative to time limit,

...

...
amongst other things, although unfortunately many people don't seem to

...

...
read this. However, even if you do force people to set a time limit,

...

...
they can still choose not to think about it and just set the maximum.

...

...

...

...
...

If a job is killed before it's done and all the progress is lost because the job wasn't checkpointing, they can't blame you as the admin.

...

...

...

...
I don't really understand this point. The limit is just the way it is,

...

...
just as we have caps on the total number of cores or GPUs the jobs given

...

...
user can use at any one time. Up to now no-one has blamed us for this.

...

...

...

...
...
If you do this, it's easy to get the users on board by first providing useful and usable documentation on why timelimits are needed and how to set them. Be

...

...
...
sure to hammer home the point that effective timelimits can lead to their jobs running sooner, and that effective timelimits can increase cluster

...

...
...
efficiency/utilization, helping them get a better return on their investment (if they contribute to the clusters cost) or they'll get more science done. I like to

...

...
...
frame it that accurate wallclock times will give them a competitive edge in getting their jobs running before other cluster users. Everyone likes to think what

...

...
...
they're doing will give them an advantage!

...

...

...

...
I agree with all this and this is also what we also try to do. The only

...

...
thing I don't concur with is your last sentence. In my experience, as

...

...
long as things work, users will in general not give a fig about whether

...

...
they are using resources efficiently. Only when people notice a delay

...

...
in jobs starting do they become more aware about it and are prepared to

...

...
take action. It is particularly a problem with new users, because

...

...
fairshare means that their jobs will start pretty quickly, no matter how

...

...
inefficiently they have configured them. Maybe we should just give new

...

...
users fewer share initially and only later bump them up to some standard

...

...
value.

...

...

...

...
Cheers,

...

...

...

...
Loris

...

...

...

...
...
My 4 cents (adjusted for inflation).

...

...
...

...

...
...
Prentice

...

...
...

...

...
...
On 6/12/25 9:11 PM, Davide DelVento via slurm-users wrote:

...

...
...

...

...
...
Sounds good, thanks for confirming it.

...

...
...
Let me sleep on it wrt the "too many" QOS, or think if I should ditch this idea.

...

...
...
If I'll implement it, I'll post in this conversation details on how I did it.

...

...
...
Cheers

...

...
...

...

...
...
On Thu, Jun 12, 2025 at 6:59 AM Ansgar Esztermann-Kirchner aeszter@mpinat.mpg.de wrote:

...

...
...

...

...
...
On Thu, Jun 12, 2025 at 04:52:24AM -0600, Davide DelVento wrote:

...

...
...
...
Hi Ansgar,

...

...
...
...

...

...
...
...
This is indeed what I was looking for: I was not aware of PreemptExemptTime.

...

...
...
...

...

...
...
...
From my cursory glance at the documentation, it seems

...

...
...
...
that PreemptExemptTime is QOS-based and not job based though. Is that

...

...
...
...
correct? Or could it be set per-job, perhaps on a prolog/submit lua script?

...

...
...

...

...
...
Yes, that's correct.

...

...
...
I guess you could create a bunch of QOS with different

...

...
...
PremptExemptTimes and then let the user select one (or indeed select

...

...
...
it from lua) but as far as I know, there is no way to set arbitrary

...

...
...
per-job values.

...

...
...

...

...
...
Best,

...

...
...

...

...
...
A.

...

...
...
--

...

...
...
Ansgar Esztermann

...

...
...
Sysadmin Dep. Theoretical and Computational Biophysics

...

...
...
https://urldefense.com/v3/__https://www.mpinat.mpg.de/person/11315/3883774__...https://urldefense.com/v3/__https:/www.mpinat.mpg.de/person/11315/3883774__;!!NuzbfyPwt6ZyPHQ!pzCg16AzlVsRop9e83uUcbZt-GVO2XTyrusyMX9GljKv3d9nJtQDwTCAyGKAkgH4Ov1hF5BZIMqK0idZxv4Cu-WFy5yFuD5KsH9Nrhw$[mpinat[.]mpg[.]de]

...

...
...

...

...
...

...

...
...
--

...

...
...
Prentice Bisbal

...

...
...
HPC Systems Engineer III

...

...
...
Computational & Information Systems Laboratory (CISL)

...

...
...
NSF National Center for Atmospheric Research (NSF NCAR)

...

...
...
https://urldefense.com/v3/__https://www.cisl.ucar.edu__;!!NuzbfyPwt6ZyPHQ!pz...https://urldefense.com/v3/__https:/www.cisl.ucar.edu__;!!NuzbfyPwt6ZyPHQ!pzCg16AzlVsRop9e83uUcbZt-GVO2XTyrusyMX9GljKv3d9nJtQDwTCAyGKAkgH4Ov1hF5BZIMqK0idZxv4Cu-WFy5yFuD5K_CSMT48$[cisl[.]ucar[.]edu]

...

...
...
https://urldefense.com/v3/__https://ncar.ucar.edu__;!!NuzbfyPwt6ZyPHQ!pzCg16...https://urldefense.com/v3/__https:/ncar.ucar.edu__;!!NuzbfyPwt6ZyPHQ!pzCg16AzlVsRop9e83uUcbZt-GVO2XTyrusyMX9GljKv3d9nJtQDwTCAyGKAkgH4Ov1hF5BZIMqK0idZxv4Cu-WFy5yFuD5KArhlxNM$[ncar[.]ucar[.]edu]

...

...
--

...

...
Dr. Loris Bennett (Herr/Mr)

...

...
FUB-IT, Freie Universität Berlin

...

...

...

...

...

...
--

...

...
slurm-users mailing list -- slurm-users@lists.schedmd.com

...

...
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

...

--

...

slurm-users mailing list -- slurm-users@lists.schedmd.com

...

To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

slurm-users mailing list -- slurm-users@lists.schedmd.com

To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

CONFIDENTIALITY NOTICE: This e-mail, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information protected by law. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.

Christopher Samuel

9:02 p.m.

On 6/20/25 2:36 pm, Smith, Sebastian via slurm-users wrote:

...

I forced my users to specify time limits and they quickly adapted:

You can also set a default (and maximum) time limit per partition, our default time limits are set to 10 minutes, QOS's have limits in between the partition default and the partition maximum.

All the best, Chris

-- Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Prentice Bisbal

24 Jun 24 Jun

3:03 p.m.

I'm the one to blame for taking this conversation off target. Sorry about that!

Unfortunately, people are hard, and I don't think any amount of technology will ever fix that. 😉

Thanks for explaining your situation, it's certainly different than what most of us see. I would say you need to plan for growth (i.e., a busy cluster). It sounds like you're already heading that way now that you've already fixed the usability issue(s). Every time you make a policy change, it takes effort to get the word out and retrain/recondition your users to adapt to the change, so my advice is whatever policy you go with now, try to choose one that would survive at least a couple of increased cluster usage at the rate your seeing it increase now.

I take the "if you build it, they will come" attitude - if you design your cluster to handle a lot of traffic, it will!

...

imagine a user with a weeklong job who estimated a 7 day wallclock limit and "for good measure" requested 8 days, but then the job would actually take 9 days.

As much as I advocate for accurate timelimits, I always tell my users to specify a bit more, with a goal of 10-15% over. If they're not sure how long the job will run, or they have low confidence in predicting run time then they shouldn't be afraid to be be more generous with their estimates, and as they run more jobs, they'll get a better feel for predicting run time. Having a job get killed 5 minutes before it finishes after running for 72 hours is also wasteful of computer time. It's a balancing act.

Yes there's checkpointing, but that's way outside the scope of this conversation.

Prentice

On 6/17/25 8:15 AM, Davide DelVento via slurm-users wrote:

...

This conversation is drifting a bit away from my initial questions and covering various other related topics. In fact I do agree with almost everything written in the last few messages. However, that is somewhat orthogonal to my initial request, which I now understand has the answer "not possible with slurm configuration, possible with ugly hacks which are probably error prone and not worth the hassle". Just for the sake of the discussion (since I'm enjoying hearing the various perspectives) I'll restate my request and why I think slurm does not support this need.

Most clusters have very high utilization all the time. This is good for ROI etc but annoying to users. Forcing users to specify a firm wallclock limit helps slurm make good scheduling decisions, which keep utilization (ROI, etc) high and minimizes wait time for everybody.

At the place where I work there is a quite different situation: there are moments of high pressure and long wait, and there are moments in which its utilization drops under 50% and sometimes even under 25% (e.g. during long weekends). We can have a discussion about it, but the bottom line is that management (ROI, etc) is fine with it, so that's the way it is. This circumstance, I agree, is quite peculiar and not shared by any other place I worked before or where I ever had an account and saw how things were, but that is what it is. In this circumstance it feels at least silly and perhaps even extremely wasteful and annoying to let slurm cancel jobs at their wallclock limit without considering other context. I mean, imagine a user with a weeklong job who estimated a 7 day wallclock limit and "for good measure" requested 8 days, but then the job would actually take 9 days. Imagine that the 8th day happened in the middle of on a long weekend when utilization was 25% and there was not a single other job pending. Maybe this job is a one-off experiment quickly cobbled together to test one thing, so it's not a well-designed piece of code and does not have checkpoint-restart capabilities. Why enforce the wallclock limit in that situation?

The way around this problem in the past was to simply not make the wallclock limit mandatory (which was decided by my predecessor, who has now left). That worked, only because the cluster was not in a very good usability status so most people avoided it anyway and there seldom was a long line of jobs pending in the queue, so slurm did not need to work very hard to schedule things. Now that I've improved the usability situation, this has become a problem, because utilization has become much higher. Perhaps in a short time people will learn to plan ahead and submit more jobs and fill the machine up during the weekends too (I'm working on user education towards that), and if that happens, it will make the above dilemma go away. But for now I have it.

I'm still mulling on how to best proceed. Maybe just force the users to set a wallclock limit and live with it.

Here is another idea that just came to me. Does slurm have a "global" switch to turn on/off cancelling jobs hitting their wallclock limit? If so, I could have a cron job checking if there are pending jobs in the queue and if not shut it off, and if so turn it on. Granted, that may be sloppy (e.g. one job pending for one resources causing the cancelling of jobs using other resources) but it's something and it'd be easy to implement compared to the turn on/off pre-emption as discussed in a previous message.

Great conversation folks, enjoying reading the various perspectives at different sites!

On Tue, Jun 17, 2025 at 12:26 AM Loris Bennett via slurm-users slurm-users@lists.schedmd.com wrote:
Hi Prentice,

Prentice Bisbal via slurm-users

<slurm-users@lists.schedmd.com> writes:

> I think the idea of having a generous default timelimit is the
wrong way to go. In fact, I think any defaults for jobs are a bad
way to go.  The majority of your
> users will just use that default time limit, and backfill
scheduling will remain useless to you.

Horses for courses, I would say.  We have a default time of 14
days, but
because we also have QoS with increased priority, but shorter time
limits, there is still an incentive for users to set the time limit
themselves.  So currently we have around 900 jobs running, only 100 of
which are using the default time limit.  Many of these will be
long-running Gaussian jobs and will indeed need the time.

> Instead, I recommend you use your job_submit.lua to reject all
jobs that don't have a wallclock time and print out a helpful
error message to inform users they
> now need to specify a wallclock time, and provide a link to
documentation on how to do that.
>
> Requiring users to specify a time limit themselves does two things:
>
> 1. It reminds them that it's important to be conscious of
timelimits when submitting jobs

This is a good point.  We use 'jobstats', which provides information
after a job has completed, about run time relative to time limit,
amongst other things, although unfortunately many people don't seem to
read this.  However, even if you do force people to set a time limit,
they can still choose not to think about it and just set the maximum.

> 2. If a job is killed before it's done and all the progress is
lost because the job wasn't checkpointing, they can't blame you as
the admin.

I don't really understand this point.  The limit is just the way
it is,
just as we have caps on the total number of cores or GPUs the jobs
given
user can use at any one time.  Up to now no-one has blamed us for
this.

> If you do this, it's easy to get the users on board by first
providing useful and usable documentation on why timelimits are
needed and how to set them. Be
> sure to hammer home the point that effective timelimits can lead
to their jobs running sooner, and that effective timelimits can
increase cluster
> efficiency/utilization, helping them get a better return on
their investment (if they contribute to the clusters cost) or
they'll get more science done. I like to
> frame it that accurate wallclock times will give them a
competitive edge in getting their jobs running before other
cluster users. Everyone likes to think what
> they're doing will give them an advantage!

I agree with all this and this is also what we also try to do. 
The only
thing I don't concur with is your last sentence.  In my experience, as
long as things work, users will in general not give a fig about
whether
they are using resources efficiently.  Only when people notice a delay
in jobs starting do they become more aware about it and are
prepared to
take action.  It is particularly a problem with new users, because
fairshare means that their jobs will start pretty quickly, no
matter how
inefficiently they have configured them.  Maybe we should just
give new
users fewer share initially and only later bump them up to some
standard
value.

Cheers,

Loris

> My 4 cents (adjusted for inflation).
>
> Prentice
>
> On 6/12/25 9:11 PM, Davide DelVento via slurm-users wrote:
>
>  Sounds good, thanks for confirming it.
>  Let me sleep on it wrt the "too many" QOS, or think if I should
ditch this idea.
>  If I'll implement it, I'll post in this conversation details on
how I did it.
>  Cheers
>
>  On Thu, Jun 12, 2025 at 6:59 AM Ansgar Esztermann-Kirchner
<aeszter@mpinat.mpg.de> wrote:
>
>  On Thu, Jun 12, 2025 at 04:52:24AM -0600, Davide DelVento wrote:
>  > Hi Ansgar,
>  >
>  > This is indeed what I was looking for: I was not aware of
PreemptExemptTime.
>  >
>  > From my cursory glance at the documentation, it seems
>  > that PreemptExemptTime is QOS-based and not job based though.
Is that
>  > correct? Or could it be set per-job, perhaps on a
prolog/submit lua script?
>
>  Yes, that's correct.
>  I guess you could create a bunch of QOS with different
>  PremptExemptTimes and then let the user select one (or indeed
select
>  it from lua) but as far as I know, there is no way to set arbitrary
>  per-job values.
>
>  Best,
>
>  A.
>  --
>  Ansgar Esztermann
>  Sysadmin Dep. Theoretical and Computational Biophysics
> https://www.mpinat.mpg.de/person/11315/3883774
>
>
> --
> Prentice Bisbal
> HPC Systems Engineer III
> Computational & Information Systems Laboratory (CISL)
> NSF National Center for Atmospheric Research (NSF NCAR)
> https://www.cisl.ucar.edu
> https://ncar.ucar.edu
-- 
Dr. Loris Bennett (Herr/Mr)
FUB-IT, Freie Universität Berlin


-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com

Age (days ago)

Last active (days ago)

slurm-users@lists.schedmd.com

14 comments

7 participants

tags (0)

participants (7)

Ansgar Esztermann-Kirchner
Christopher Samuel
Davide DelVento
Loris Bennett
Michael DiDomenico
Prentice Bisbal
Smith, Sebastian