[slurm-users] Features request

Thu Sep 24 22:04:28 UTC 2020

Hello all,

We're mostly a GPU compute shop, and we've been happy with slurm for the 
last three years, but we think slurm would benefit from the following 
two features:

1. Allow preemption in the same QOS, all else being equal, based on job 
priority.

2. Job size calculation to take into account the number of GPUs 
allocated to the job. In a GPU cluster the most valuable currency being 
the GPU, not the CPU. Perhaps even parameterize the job size so the user 
could choose what to emphasize in calculation: cpu, gpu, memory.

If this is not the right place to ask for this, I would appreciate a 
pointer in the right direction.

Justification:

It's pretty obvious why we'd like #2.

We want #1 because we believe it would allow for a more natural 
maximization of the cluster usage. A user X could grab the whole cluster 
if it's free, while another user Y, arriving later could get jobs in by 
preempting some of the jobs of X. We're assuming the fairshare score of 
user X will decrease as resources are consumed, and Y's jobs will have a 
higher priority. We also assume that requeue, checkpoint and restart are 
employed. We also think that this would make the system more fair in the 
long term, essentially time slicing usage through preemption based on 
priority.

Relu