[slurm-users] Priority jobs interfering with predictive scheduling

Wed Apr 12 22:52:18 UTC 2023

Our cluster has some nodes separated to their own partition for running 
interactive sessions, which are required to be short and only use a few 
nodes.
I've always disliked this approach because I see some of the interactive 
nodes being idle while other jobs are waiting on the batch partition.

I'd proposed that the "interactive" ought to just draw from the regular 
pool of nodes, parameterized as a QOS or another partition, as follows:

1. Only a few interactive jobs can run at a given time.
2. A single user can only have one interactive job running or queued.
3. Only a few nodes can be used by an interactive job.
4. The interactive jobs have higher priority than batch jobs.

The #4 would give the user a more immediate startup. Not quite as good 
as running from a separate pool of nodes, but I wouldn't expect the 
wait-times to be long on a big enough cluster.
Here's a problem the Admins ran into when they tried this sort of thing:

A. The predictive scheduler knows the maximum time a large job has to 
wait to gather all the nodes it needs, just by looking at the 
time-limits on all the jobs still running.
B. If a higher-priority job comes in during this "gather" phase, though, 
it will steal one of the idle nodes that were held for the big job.
C. Given that more nodes now need to be gathered, the predictive 
scheduler will assign a different maximum wait-time to this job, and may 
start a smaller job instead with the pool of nodes that have been 
accumulated.

The result is that the job-order can get perturbed quite a bit and a 
large job could end up waiting longer than if the interactive jobs drew 
from a separate pool of nodes.
Also if it ends up running some smaller job first, not all of the 
gathered nodes would have needed to sit idle to begin with, and some 
node-hours will have gone to waste.
Do any of you know a way to control this?

If the "interactive" jobs were limited to, say, 10 total, the predictive 
scheduler could look at the time it would take to gather N+10 nodes 
instead of N, in which case I think the schedule would behave more 
deterministically.
There'd be a special case if (N+10) is more than the number of nodes on 
the cluster, of course.
And you wouldn't really need to schedule for (N+10) nodes, it would be 
(N+10-x) where "x" is the number of nodes currently being consumed by 
interactive jobs.