[slurm-users] Advice on managing GPU cards using SLURM

Mon Mar 5 09:14:30 MST 2018

Hello,

I'm sure that this question has been asked before. We have recently added
some GPU nodes to our SLURM cluster.

There are 10 nodes each providing 2 * Tesla V100-PCIE-16GB cards There are
10 nodes each providing 4 * GeForce GTX 1080 Ti cards

I'm aware that the simplest way to manage these resources is to probably
setup one or two partitions. Then users would have exclusive access to each
of these nodes.

Alternatively, I suspect it's possible to manage all these nodes using a
single partition and additionally to allow users to submit multiple jobs to
these nodes (let's say they wish to use just one GPU card in a job, for
example). Then I would guess that we would have to provide a gres.conf on
each of the GPU nodes, (and additionally enable users to use the SLURM
"feature" option to specify the card type). The gres.conf file could,
presumably, be configured to specify the number and type of GPU cards on
each node so that the users could then request the number/type of GPUs
without the feature option.

Also, I suspect we will not want the default QOS to apply to the GPU nodes.
I'm not sure if there is a clever way of to specify certain user limits on
the partition definition rather than define another QOS.

Any help or tips on getting the configuration started -- so that the user
interface is not too complex -- would be really appreciated, please.

Best regards,

David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180305/0c5997b5/attachment-0001.html>