[slurm-users] Backfill pushing jobs back

David Baker D.J.Baker at soton.ac.uk
Wed Dec 9 11:35:20 UTC 2020


Hello,


We see the following issue with smaller jobs pushing back large jobs. We are using slurm 19.05.8 so not sure if this is patched in newer releases. With a 4 node test partition I submit 3 jobs as 2 users



ssh hpcdev1 at navy51 'sbatch --nodes=3 --ntasks-per-node=40 --partition=backfilltest --time=120 --wrap="sleep 7200"'

ssh hpcdev2 at navy51 'sbatch --nodes=4 --ntasks-per-node=40 --partition=backfilltest --time=60 --wrap="sleep 3600"'

ssh hpcdev2 at navy51 'sbatch --nodes=4 --ntasks-per-node=40 --partition=backfilltest --time=60 --wrap="sleep 3600"'



Then I increase the priority of the pending jobs significantly. Reading the manual, my understanding is that nodes job should be held for these jobs.

for job in $(squeue -h -p backfilltest -t pd -o %i); do scontrol update job ${job} priority=1000000000;done



squeue -p backfilltest -o "%i | %u | %C | %Q | %l | %S | %T"

JOBID | USER | CPUS | PRIORITY | TIME_LIMIT | START_TIME | STATE

28482 | hpcdev2 | 160 | 1000000000 | 1:00:00 | N/A | PENDING

28483 | hpcdev2 | 160 | 1000000000 | 1:00:00 | N/A | PENDING

28481 | hpcdev1 | 120 | 50083 | 2:00:00 | 2020-12-08T09:44:15 | RUNNING



So, there is one node free in our 4 node partition. Naturally, a small job with a walltime of less than 1 hour could run in that but we are also seeing backfill start longer jobs.



backfilltest    up 2-12:00:00      3  alloc reddev[001-003]

backfilltest    up 2-12:00:00      1   idle reddev004





ssh hpcdev3 at navy51 'sbatch --nodes=1 --ntasks-per-node=40 --partition=backfilltest --time=720 --wrap="sleep 432000"'





squeue -p backfilltest -o "%i | %u | %C | %Q | %l | %S | %T"

JOBID | USER | CPUS | PRIORITY | TIME_LIMIT | START_TIME | STATE

28482 | hpcdev2 | 160 | 1000000000 | 1:00:00 | N/A | PENDING

28483 | hpcdev2 | 160 | 1000000000 | 1:00:00 | N/A | PENDING

28481 | hpcdev1 | 120 | 50083 | 2:00:00 | 2020-12-08T09:44:15 | RUNNING

28484 | hpcdev3 | 40 | 37541 | 12:00:00 | 2020-12-08T09:54:48 | RUNNING



Is this expect behaviour? It is also weird that the pending jobs don't have a start time. I have increased the backfill parameters significantly, but it doesn't seem to affect this at all.



SchedulerParameters=bf_window=14400,bf_resolution=2400,bf_max_job_user=80,bf_continue,default_queue_depth=1000,bf_interval=60


Best regards,

David

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201209/fb23c76a/attachment-0001.htm>


More information about the slurm-users mailing list