[slurm-users] A Slurm topological scheduling question

Tue Dec 7 16:29:09 UTC 2021

This should be fine assuming you don't mind the mismatch in CPU speeds.  
Unless the codes are super sensitive to topology things should be okay 
as modern IB is wicked fast.

In our environment here we have a variety of different hardware types 
all networked together on the same IB fabric.  That said we create 
partitions for different hardware types and we don't have a queue that 
schedules across both, though we do have a backfill serial queue that 
underlies everything.  All of that though is scheduled via a single 
scheduler with a single topology.conf governing it all.  We also have 
all our internode IP comms going over our IB fabric and it works fine.

-Paul Edmon-

On 12/7/2021 11:05 AM, David Baker wrote:
> Hello,
>
> These days we have now enabled topology aware scheduling on our Slurm 
> cluster. One part of the cluster consists of two racks of AMD compute 
> nodes. These racks are, now, treated as separate entities by Slurm. 
> Soon, we may add another set of AMD nodes with slightly difference CPU 
> specs to the existing nodes. We'll aim to balance the new nodes across 
> the racks re cooling/heating requirements. The new nodes will be 
> controlled by a new partition.
>
> Does anyone know if it is possible to regard the two racks as a single 
> entity (by connecting the InfiniBand switches together), and so 
> schedule jobs across the resources in the racks with no loss 
> efficiency. I would be grateful for your comments and ideas, please. 
> The alternative is to put all the new nodes in a completely new rack, 
> but that does mean that we'll have purchase some new Ethernet and IB 
> switches. We are not happy, by the way, to have node/switch 
> connections across racks.
>
> Best regards,
> David
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20211207/f93041c2/attachment.htm>