This conversation is drifting a bit away from my initial questions and covering various other related topics. In fact I do agree with almost everything written in the last few messages. However, that is somewhat orthogonal to my initial request, which I now understand has the answer "not possible with slurm configuration, possible with ugly hacks which are probably error prone and not worth the hassle". Just for the sake of the discussion (since I'm enjoying hearing the various perspectives) I'll restate my request and why I think slurm does not support this need.
Most clusters have very high utilization all the time. This is good for ROI etc but annoying to users. Forcing users to specify a firm wallclock limit helps slurm make good scheduling decisions, which keep utilization (ROI, etc) high and minimizes wait time for everybody.
At the place where I work there is a quite different situation: there are moments of high pressure and long wait, and there are moments in which its utilization drops under 50% and sometimes even under 25% (e.g. during long weekends). We can have a discussion about it, but the bottom line is that management (ROI, etc) is fine with it, so that's the way it is. This circumstance, I agree, is quite peculiar and not shared by any other place I worked before or where I ever had an account and saw how things were, but that is what it is. In this circumstance it feels at least silly and perhaps even extremely wasteful and annoying to let slurm cancel jobs at their wallclock limit without considering other context. I mean, imagine a user with a weeklong job who estimated a 7 day wallclock limit and "for good measure" requested 8 days, but then the job would actually take 9 days. Imagine that the 8th day happened in the middle of on a long weekend when utilization was 25% and there was not a single other job pending. Maybe this job is a one-off experiment quickly cobbled together to test one thing, so it's not a well-designed piece of code and does not have checkpoint-restart capabilities. Why enforce the wallclock limit in that situation?
The way around this problem in the past was to simply not make the wallclock limit mandatory (which was decided by my predecessor, who has now left). That worked, only because the cluster was not in a very good usability status so most people avoided it anyway and there seldom was a long line of jobs pending in the queue, so slurm did not need to work very hard to schedule things. Now that I've improved the usability situation, this has become a problem, because utilization has become much higher. Perhaps in a short time people will learn to plan ahead and submit more jobs and fill the machine up during the weekends too (I'm working on user education towards that), and if that happens, it will make the above dilemma go away. But for now I have it.
I'm still mulling on how to best proceed. Maybe just force the users to set a wallclock limit and live with it.
Here is another idea that just came to me. Does slurm have a "global" switch to turn on/off cancelling jobs hitting their wallclock limit? If so, I could have a cron job checking if there are pending jobs in the queue and if not shut it off, and if so turn it on. Granted, that may be sloppy (e.g. one job pending for one resources causing the cancelling of jobs using other resources) but it's something and it'd be easy to implement compared to the turn on/off pre-emption as discussed in a previous message.
Great conversation folks, enjoying reading the various perspectives at different sites!