[slurm-users] Priority jobs interfering with predictive scheduling
Carl Ponder
cponder at nvidia.com
Wed Apr 12 22:52:18 UTC 2023
Our cluster has some nodes separated to their own partition for running
interactive sessions, which are required to be short and only use a few
nodes.
I've always disliked this approach because I see some of the interactive
nodes being idle while other jobs are waiting on the batch partition.
I'd proposed that the "interactive" ought to just draw from the regular
pool of nodes, parameterized as a QOS or another partition, as follows:
1. Only a few interactive jobs can run at a given time.
2. A single user can only have one interactive job running or queued.
3. Only a few nodes can be used by an interactive job.
4. The interactive jobs have higher priority than batch jobs.
The #4 would give the user a more immediate startup. Not quite as good
as running from a separate pool of nodes, but I wouldn't expect the
wait-times to be long on a big enough cluster.
Here's a problem the Admins ran into when they tried this sort of thing:
A. The predictive scheduler knows the maximum time a large job has to
wait to gather all the nodes it needs, just by looking at the
time-limits on all the jobs still running.
B. If a higher-priority job comes in during this "gather" phase, though,
it will steal one of the idle nodes that were held for the big job.
C. Given that more nodes now need to be gathered, the predictive
scheduler will assign a different maximum wait-time to this job, and may
start a smaller job instead with the pool of nodes that have been
accumulated.
The result is that the job-order can get perturbed quite a bit and a
large job could end up waiting longer than if the interactive jobs drew
from a separate pool of nodes.
Also if it ends up running some smaller job first, not all of the
gathered nodes would have needed to sit idle to begin with, and some
node-hours will have gone to waste.
Do any of you know a way to control this?
If the "interactive" jobs were limited to, say, 10 total, the predictive
scheduler could look at the time it would take to gather N+10 nodes
instead of N, in which case I think the schedule would behave more
deterministically.
There'd be a special case if (N+10) is more than the number of nodes on
the cluster, of course.
And you wouldn't really need to schedule for (N+10) nodes, it would be
(N+10-x) where "x" is the number of nodes currently being consumed by
interactive jobs.
More information about the slurm-users
mailing list