[slurm-users] A Slurm topological scheduling question

Tue Dec 7 21:04:38 UTC 2021

Hi David,

The topology.conf file groups nodes into sets such that parallel jobs 
will not be scheduled by Slurm across disjoint sets.  Even though the 
topology.conf man-page refers to network switches, it's really about 
topology rather than network.

You may use fake (non-existing) switch names to describe the topology. 
For example, we have a small IB sub-cluster with two IB switches defined by:

SwitchName=mell023 Switches=mell0[2-3]
SwitchName=mell02 Nodes=i[004-028]
SwitchName=mell03 Nodes=i[029-050]

If you comment out the first line mell023, you create two disjoint node 
groups ("islands") i[004-028] and i[029-050] where jobs won't be 
scheduled across node groups.

Physical switches and racks are irrelevant here.  In your example, you 
could add the new AMD nodes with a fake switch name in order to create a 
new "island" of nodes.  The IB fabric subnet manager of course keeps 
track of the real fabric topology, independently of Slurm.

BTW, let me remind you of my Infiniband topology tool slurmibtopology.sh 
for Slurm:
   https://github.com/OleHolmNielsen/Slurm_tools/tree/master/slurmibtopology
which generates an initial topology.conf file for IB networks according 
to the physical links in the fabric.

I hope this helps.

/Ole

On 07-12-2021 17:05, David Baker wrote:
> These days we have now enabled topology aware scheduling on our Slurm 
> cluster. One part of the cluster consists of two racks of AMD compute 
> nodes. These racks are, now, treated as separate entities by Slurm. 
> Soon, we may add another set of AMD nodes with slightly difference CPU 
> specs to the existing nodes. We'll aim to balance the new nodes across 
> the racks re cooling/heating requirements. The new nodes will be 
> controlled by a new partition.
> 
> Does anyone know if it is possible to regard the two racks as a single 
> entity (by connecting the InfiniBand switches together), and so schedule 
> jobs across the resources in the racks with no loss efficiency. I would 
> be grateful for your comments and ideas, please. The alternative is to 
> put all the new nodes in a completely new rack, but that does mean that 
> we'll have purchase some new Ethernet and IB switches. We are not happy, 
> by the way, to have node/switch connections across racks.