[slurm-users] Areas for improvement on our site's cluster scheduling

Mon May 7 22:41:19 MDT 2018

We have two main issues with our scheduling policy right now. The first is an issue that we call "queue stuffing." The second is an issue with interactive job availability. We aren't confused about why these issues exist, but we aren't sure the best way to address them.

I'd love to hear any suggestions on how other sites address these issues. Thanks for any advice!

## Queue stuffing

We use multifactor scheduling to provide account-based fairshare scheduling as well as standard fifo-style job aging. In general, this works pretty well, and accounts meet their scheduling targets; however, every now and again, we have a user who has a relatively high-throughput (not HPC) workload that they're willing to wait a significant period of time for. They're low-priority work, but they put a few thousand jobs into the queue, and just sit and wait. Eventually the job aging makes the jobs so high-priority, compared to the fairshare, that they all _as a set_ become higher-priority than the rest of the work on the cluster. Since they continue to age as the other jobs continue to age, these jobs end up monopolizing the cluster for days at a time, as their high volume of relatively small jobs use up a greater and greater percentage of the machine.

In Moab I'd address this by limiting the number of jobs the user could have *eligible* at any given time; but it appears that the only option for slurm is limiting the number of jobs a user can *submit*, which isn't as nice a user experience and can lead to some pathological user behaviors (like users running cron jobs that wake repeatedly and submit more jobs automatically).

## Interactive job availability

I'm becoming increasingly convinced that holding some portion of our resource aside as dedicated for relatively short, small, interactive jobs is a unique good; but I'm not sure how best to implement it. My immediate thought was to use a reservation with the DAILY and REPLACE flags. I particularly like the idea of using the REPLACE flag here as we could keep a flexible amount of resources available irrespective of how much was actually being used for the purpose at any given time; but it doesn't appear that there's any way to limit the per-user use of resources *within* a reservation; so if we created such a reservation and granted all users access to it, any individual user would be capable of consuming all resources in the reservation anyway. I'd have a dedicated "interactive" qos or similar to put such restrictions on; but there doesn't appear to be a way to then limit the use of the reservation to only jobs with that qos. (Aside from job_submit scripts or similar. Please correct me if I'm wrong.)

In lieu of that, I'm leaning towards having a dedicated interactive partition that we'd manually move some resources to; but that's a bit less flexible.