[slurm-users] Backfill pushing jobs back
D.J.Baker at soton.ac.uk
Mon Jan 4 14:57:22 UTC 2021
Last year I posted on this forum looking for some help on backfill in Slurm. We are currently using Slurm 19.05.8 and we find that backfilled (smaller) jobs tend to push back large jobs in our cluster. Chris Samuel replied to our post with the following response...
This sounds like a problem that we had at NERSC (small jobs pushing back multi-thousand node jobs), and we carried a local patch for which Doug
managed to get upstreamed in 20.02.x (I think it landed in 20.02.3, but 20.02.6 is the current version).
We looked through the release notes and sure enough there is a reference to a job starvation patch, however I'm not sure that it is the relevant patch... (in 20.02.2)
> -- Fix scheduling issue when there are not enough nodes available to run a job
> resulting in possible job starvation.
We decided to download and install the latest production version, 20.11.2, of Slurm. One of my team members managed the installation and ran his backfill tests only to find that the above backfill issue was still present. Should we wind back to version 20.02.6 and insall/test that instead? Could someone please advise use? It would seem odd that a recent version of slurm would still have a backfill issue that starves larger job out. We're wondering if you have forgotten to configure something very fundamental, for example.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users