[slurm-users] A Slurm topological scheduling question

Tue Dec 7 16:05:19 UTC 2021

Hello,

These days we have now enabled topology aware scheduling on our Slurm cluster. One part of the cluster consists of two racks of AMD compute nodes. These racks are, now, treated as separate entities by Slurm. Soon, we may add another set of AMD nodes with slightly difference CPU specs to the existing nodes. We'll aim to balance the new nodes across the racks re cooling/heating requirements. The new nodes will be controlled by a new partition.

Does anyone know if it is possible to regard the two racks as a single entity (by connecting the InfiniBand switches together), and so schedule jobs across the resources in the racks with no loss efficiency. I would be grateful for your comments and ideas, please. The alternative is to put all the new nodes in a completely new rack, but that does mean that we'll have purchase some new Ethernet and IB switches. We are not happy, by the way, to have node/switch connections across racks.

Best regards,
David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211207/da46690e/attachment-0001.htm>