[slurm-users] Backfill advice
D.J.Baker at soton.ac.uk
Mon Mar 25 12:07:01 UTC 2019
Thank you for your detailed reply regarding how to setup backfill. There's quite a lot to take in there. Fortunately, I now have a day or two to read up and understand the ideas now that our cluster is down due to a water cooling failure. In the first instance, I'll certainly implement bf_continue and review/amend the "bf_maxjobs" and "bf_interval" parameters. Switching on backfill debugging sounds very useful, but does that setting tend to blot the logs if left enabled for long periods?
We did have a contract with SchedMD which recently finished. In one of the last discussions we had it was intimated that we may have hit a bug. That's in the respect that backfilled jobs were potentially stealing nodes intended for higher priority jobs -- bug 5297. The advice was to consider upgrading to slurm 18.08.4 and implement bf_ignore_newly_avail_nodes. I was interested to see that you had a similar discussion with SchedMD and did upgrade. I think I ought to update the bf configuration re my first paragraph and see how that goes before we bite the bullet and do the upgrade (we are at 18.08.0 currently).
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Douglas Jacobsen <dmjacobsen at lbl.gov>
Sent: 23 March 2019 13:30
To: Slurm User Community List
Subject: Re: [slurm-users] Backfill advice
At first blush bf_continue and bf_interval as well as bf_maxjobs (if I remembered the parameter correctly) are critical first steps in tuning. Setting DebugFlags=backfill is essential to getting the needed data to make tuning decisions.
Use of per user/account settings if they are too low can also cause starvation depending on the way your priority calculation is set up.
I presented these slides a few years ago ag the slurm user group on this topic:
The key thing to keep in mind with large jobs is that slurm needs to evaluate them again and again in the same order or the scheduled time may drift. Thus it is important that once jobs are getting planning reservations they must continue to do so.
Because of the prevalence of large jobs at our site we use bf_min_prio_resv which splits the priority space into a reserving and non-reserving set, and then use job age to allow jobs to age from the non reserving portion of the priority space to the reservation portion. Use of the recent MaxJobsAccruePerUser limits on a job qos can throttle the rate of jobs aging and prevent negative effects from users submitting large numbers of jobs.
I realize that is a large number of tunables and concepts densely packed, but it should give you some reasonable starting points.
On Sat, Mar 23, 2019 at 05:26 david baker <djbaker12 at gmail.com<mailto:djbaker12 at gmail.com>> wrote:
We do have large jobs getting starved out on our cluster, and I note particularly that we never manage to see a job getting assigned a start time. It seems very possible that backfilled jobs are stealing nodes reserved for large/higher priority jobs.
I'm wondering if our backfill configuration has any bearing on this issue or whether we are unfortunate enough to have hit a bug. One parameter that is missing in our bf setup is "bf_continue". Is that parameter significant in terms of ensuring that bf drills down sufficiently in the job mix? Also we are using the default bf frequency -- should we really reduce the frequency and potentially reduce the number of bf jobs per group/user or total at each iteration? Currently, I think we are setting the per/user limit to 20.
Any thoughts would be appreciated, please.
Sent from Gmail Mobile
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users