Hello! In our current cluster the workflows are quite diverse (bunch of large, long (24-72h) jobs; medium size <4h job; and many small 1 node jobs). The current priority is fair share only (averaged on a ~few months timescale). For the new setup we would like to (1) discourage the 1 node jobs [especially the long ones] (2) have a fair use of the machine for the different cases. In particular, we found that the waiting time / runtime ratio is very large for the ~1-4h jobs.
For this reason, I have been trying to read up on strategies in terms of partition / priority politics but could not find any good documentation in this regard (as https://slurm.schedmd.com/documentation.html is more focused on the technical aspects). Does anyone know any good references for this?
In particular, we were thinking about (a) somehow have a long-term as well as shorter term fairshare count, i.e., lowering the priority if a user has something running already or (b) introduce a "small" partition for the <4h jobs with higher priority but we're unsure if this will block all the larger jobs to run...
Any help or reading material is highly appreciated.
Many thanks, Max