Hello!
In our current cluster the workflows are quite diverse (bunch of large, long (24-72h) jobs; medium size <4h job; and many small 1 node jobs). The current priority is fair share only (averaged on a ~few months timescale).
For the new setup we would like to
(1) discourage the 1 node jobs [especially the long ones]
(2) have a fair use of the machine for the different cases. In particular, we found that the waiting time / runtime ratio is very large for the ~1-4h jobs.
For this reason, I have been trying to read up on strategies in terms of partition / priority politics but could not find any good documentation in this regard (as https://slurm.schedmd.com/documentation.html is more focused on the technical aspects).
Does anyone know any good references for this?
In particular, we were thinking about
(a) somehow have a long-term as well as shorter term fairshare count, i.e., lowering the priority if a user has something running already
or
(b) introduce a "small" partition for the <4h jobs with higher priority but we're unsure if this will block all the larger jobs to run...
Any help or reading material is highly appreciated.
Many thanks,
Max