[slurm-users] Advice on managing GPU cards using SLURM

Mon Mar 5 08:14:32 MST 2018

Hello,

I'm sure that this question has been asked before. We have recently added some GPU nodes to our SLURM cluster. 

There are 10 nodes each providing 2 * Tesla V100-PCIE-16GB cards
There are 10 nodes each providing 4 * GeForce GTX 1080 Ti cards

I'm aware that the simplest way to manage these resources is to probably setup one or two partitions. Then users would have exclusive access to each of these nodes. 

Alternatively, I suspect it's possible to manage all these nodes using a single partition and additionally to allow users to submit multiple jobs to these nodes (let's say they wish to use just one GPU card in a job, for example). Then I would guess that we would have to provide a gres.conf on each of the GPU nodes, (and additionally enable users to use the SLURM "feature" option to specify the card type). The gres.conf file could, presumably, be configured to specify the number and type of GPU cards on each node so that the users could then request the number/type of GPUs without the feature option. 

Also, I suspect we will not want the default QOS to apply to the GPU nodes. I'm not sure if there is a clever way of to specify certain user limits on the partition definition rather than define another QOS. 

Any help or tips on getting the configuration started -- so that the user interface is not too complex -- would be really appreciated, please.

Best regards,
David