[slurm-users] Backfill pushing jobs back
David Baker
D.J.Baker at soton.ac.uk
Wed Dec 9 11:35:20 UTC 2020
Hello,
We see the following issue with smaller jobs pushing back large jobs. We are using slurm 19.05.8 so not sure if this is patched in newer releases. With a 4 node test partition I submit 3 jobs as 2 users
ssh hpcdev1 at navy51 'sbatch --nodes=3 --ntasks-per-node=40 --partition=backfilltest --time=120 --wrap="sleep 7200"'
ssh hpcdev2 at navy51 'sbatch --nodes=4 --ntasks-per-node=40 --partition=backfilltest --time=60 --wrap="sleep 3600"'
ssh hpcdev2 at navy51 'sbatch --nodes=4 --ntasks-per-node=40 --partition=backfilltest --time=60 --wrap="sleep 3600"'
Then I increase the priority of the pending jobs significantly. Reading the manual, my understanding is that nodes job should be held for these jobs.
for job in $(squeue -h -p backfilltest -t pd -o %i); do scontrol update job ${job} priority=1000000000;done
squeue -p backfilltest -o "%i | %u | %C | %Q | %l | %S | %T"
JOBID | USER | CPUS | PRIORITY | TIME_LIMIT | START_TIME | STATE
28482 | hpcdev2 | 160 | 1000000000 | 1:00:00 | N/A | PENDING
28483 | hpcdev2 | 160 | 1000000000 | 1:00:00 | N/A | PENDING
28481 | hpcdev1 | 120 | 50083 | 2:00:00 | 2020-12-08T09:44:15 | RUNNING
So, there is one node free in our 4 node partition. Naturally, a small job with a walltime of less than 1 hour could run in that but we are also seeing backfill start longer jobs.
backfilltest up 2-12:00:00 3 alloc reddev[001-003]
backfilltest up 2-12:00:00 1 idle reddev004
ssh hpcdev3 at navy51 'sbatch --nodes=1 --ntasks-per-node=40 --partition=backfilltest --time=720 --wrap="sleep 432000"'
squeue -p backfilltest -o "%i | %u | %C | %Q | %l | %S | %T"
JOBID | USER | CPUS | PRIORITY | TIME_LIMIT | START_TIME | STATE
28482 | hpcdev2 | 160 | 1000000000 | 1:00:00 | N/A | PENDING
28483 | hpcdev2 | 160 | 1000000000 | 1:00:00 | N/A | PENDING
28481 | hpcdev1 | 120 | 50083 | 2:00:00 | 2020-12-08T09:44:15 | RUNNING
28484 | hpcdev3 | 40 | 37541 | 12:00:00 | 2020-12-08T09:54:48 | RUNNING
Is this expect behaviour? It is also weird that the pending jobs don't have a start time. I have increased the backfill parameters significantly, but it doesn't seem to affect this at all.
SchedulerParameters=bf_window=14400,bf_resolution=2400,bf_max_job_user=80,bf_continue,default_queue_depth=1000,bf_interval=60
Best regards,
David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201209/fb23c76a/attachment-0001.htm>
More information about the slurm-users
mailing list