[slurm-users] Using Nice to Break Ties
Paul Edmon
pedmon at cfa.harvard.edu
Tue Sep 14 14:32:47 UTC 2021
We use the classic fairshare algorithm here with users having their
shares set to to parent and pulling from the group pool rather than
having each user have their own fairshare (you can see our doc here:
https://docs.rc.fas.harvard.edu/kb/fairshare/). This has worked very
well for us for many years. However, there is a use case where this
doesn't work namely breaking ties internal to a group. We have a lot of
private partitions owned by a specific group and when you have a bunch
of users in that group the queue turns into FIFO instead of letting
lower usage users go first due to the parent flag on the fairshare. Now
this is obviously solved by giving every user their own fairshare but
this has the downside of impacting the users priority back on the shared
partitions with other groups where they will not be able to use their
groups full fairshare but instead are stuck with their own. Thus their
total group fairshare may be something like 0.4 but their personal is
stuck at 0 because they are one of the heaviest users in the lab.
Now I get the feeling that Fair Tree might solve this but I can't move
to it as it's taken years for our users to even understand and accept
the classic fairshare model. As such I'm trying to come up with
solutions that work with in the model. One option I have been
considering is using the job_submit.lua script to set a Nice value for
all the jobs based on that users usage. Basically the nice value would
break the internal ties of the group and allow non-FIFO scheduling
internal to accounts with out impacting their overall fairshare relative
to other groups.
Before I start messing around with this though I wanted to ping this
wisdom of the group and see how others handle tie breaking internal to
an account/group/lab? What solutions have people used for this?
-Paul Edmon-
More information about the slurm-users
mailing list