"Optimal" slurm configuration - slurm-users

23 Feb 2024


      Hello!
In our current cluster the workflows are quite diverse (bunch of large,
long (24-72h) jobs; medium size <4h job; and many small 1 node jobs). The
current priority is fair share only (averaged on a ~few months timescale).
For the new setup we would like to
(1) discourage the 1 node jobs [especially the long ones]
(2) have a fair use of the machine for the different cases. In particular,
we found that the waiting time / runtime ratio is very large for the ~1-4h
jobs.
For this reason, I have been trying to read up on strategies in terms of
partition / priority politics  but could not find any good documentation in
this regard (as https://slurm.schedmd.com/documentation.html is more
focused on the technical aspects).
Does anyone know any good references for this?
In particular, we were thinking about
(a) somehow have a long-term as well as shorter term fairshare count, i.e.,
lowering the priority if a user has something running already
or
(b) introduce a "small" partition for the <4h jobs with higher priority but
we're unsure if this will block all the larger jobs to run...
Any help or reading material is highly appreciated.
Many thanks,
Max