[slurm-users] Backfill fails to start jobs (when preemptable QOS is involved)

Sun Nov 15 14:14:15 UTC 2020

Hi list,

We have GrpTRES limits on all accounts which causes a lot of higher
priority jobs to stay in the queue due to limits. As such we rely heavily
on the backfill scheduler. We also have a special lower priority
preemptable QOS with no limits.

We've noticed that when the cluster is loaded, sending a non preemptable
but not highest priority job, will cause the backfill algorithm to fail to
start the job when it needs to kill preemptable jobs. The preemptable jobs
are killed, but the job doesn't start.

>From the logs, for job 3617065:
[2020-11-15T13:36:01.928] backfill test for JobId=3617065 Prio=680634
Partition=short
[2020-11-15T13:36:12.947] _preempt_jobs: preempted JobId=3616258 had to be
killed
[2020-11-15T13:36:12.953] _preempt_jobs: preempted JobId=3616259 had to be
killed
[2020-11-15T13:36:12.960] _preempt_jobs: preempted JobId=3616255 had to be
killed
[2020-11-15T13:36:12.966] _preempt_jobs: preempted JobId=3616256 had to be
killed
[2020-11-15T13:36:12.972] _preempt_jobs: preempted JobId=3616257 had to be
killed
[2020-11-15T13:36:12.973] backfill: planned start of JobId=3617065 failed:
Requested nodes are busy
[2020-11-15T13:36:12.973] JobId=3617065 to start at 2020-11-15T13:36:01,
end at 2020-11-15T15:36:00 on nodes dumfries-002 in partition short

Looking at job 3616258 which was preempted on time:
$ sacct -j 3616258 -ojobid,end,state
       JobID                 End      State
------------ ------------------- ----------
3616258      2020-11-15T13:36:12  PREEMPTED
3616258.bat+ 2020-11-15T13:36:50  CANCELLED
3616258.ext+ 2020-11-15T13:36:13  COMPLETED

The job was preempted at 13:36:12, but the batch script was finished only
at 13:36:50. By then the backfill already gave up. The job will start in
one of the subsequence backfill cycles, but in some cases this can take
more than 30 minutes.

Is this intentional? i.e. that the backfill will preempt jobs on the first
cycle, and run the "real" job on the second (or later) cycle?
Has anyone else encountered this?

Our slurm is 19.05.1, with KillWait=30 (we want to keep this above 0),
CompleteWait=0, and the SchedulerFlags (which was changed numerous times in
the past weeks) currently includes:
batch_sched_delay=5
bf_busy_nodes
bf_continue
bf_interval=90
bf_max_job_test=2500
bf_max_job_user_part=30
bf_max_time=270
bf_min_prio_reserve=1000000
bf_window=30300
bf_yield_interval=5000000
default_queue_depth=2000
defer
kill_invalid_depend
max_rpc_cnt=150
preempt_strict_order
sched_interval=120
sched_min_interval=1000000

Thanks in advance,
    Yair.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201115/ce3ad2b4/attachment.htm>