[slurm-users] Slurm gpu vs cpu via partition, fair share and/or sos help

Mon Jul 20 14:13:43 UTC 2020

Good morning. 
I’m wondering if one could point me in the right direction to fulfill a request on one of our small clusters.

Cluster info:
 * 5 nodes with 4 gpus/28 cpus each node. 
 * User 1 only will submit to cpus, all other 8 users will submit to gpus
 * only one account in the database with 9 users.

* All users should be able to run on all cpus or gpus in the cluster at a time *if* the queue is empty. (max_jobs_per_user: 20 on gpus/120 on cpus)
* If there is a wait queue, a maximum job_per_user should be set to 10 for gpu requests and 60 for cpu requests.
* owner does NOT want the limits set to how many processors/gpus a user can use at a time. 
* The user that may have 10 jobs running and 100 in the wait queue only has their job run once one of their 10 has ended. (i.e. if one of their 10 jobs ends, another user’s job in the queue does not begin)

* My dilema is the part that “Max jobs per user should be 20 on gpus and 120 on cpus” if the queue is empty. and “Max jobs per user should be 10 on gpus and 60 on cpus” if the  queue is not empty. 

I’ve gone round and round of which path I should be going down:

  * separate partitions (one for gpu and one for cpus) and set limits by partition in the slurm.conf
  * QOS for max job limits (a qos such as low, normal, high is not necessary); I would only want the qos for max limits (?) 
  * one partition that has the fairshare strictly handle this with  something like:   TRESBillingWeights="CPU=1.0,Mem=0.25G,gres/gpu=20”

Any advice to which path(s) to go down to get this solution would be greatly appreciated!!!
Jodie

Jodie Sprouse
Systems Administrator
Cornell University 
Center for Advanced Computing
Ithaca, NY 14850
jhs43 at cornell.edu