Hello everyone,
I'm trying to improve topology awareness in a local Slurm-managed HPC system. It's using the default hierarchical 3-level topology with the tree-plugin. It however does not always confine jobs to the most tightly packed group of nodes, seems to over-provision switches for smaller jobs, and gets slow or overwhelmed with jobs that have a high node count. I'd like to implement something more literally aligned with best-fit, but I'm having trouble understanding the relevant interfaces to hook into the topology model of Slurm. I would like a high-level explanation of how the tree- and common topology components work, how they integrate into the higher scheduling logic and what the internal topology model looks like. Or some pointers to relevant docs discussing this.
I have read the topology guide and its dev-doc, which does note some of the caveats I mentioned. It however only talks about providing a set of weights to the upper logic levels in the form of a node ranking. I can't see how this ranking resembles the topology and how it's being used. From looking at the signatures and C-code I can tell this much:
topology-tree consumes the topology.conf and generates a ranking of some kind that is passed to topology-common.
topology-common consumes a ranking and uses its own gres-sched to figure out what nodes can fit a job (possibly pulling info from the gres-select-plugin to determine node capabilities).
It's then supposed to apply a best-fit algorithm to efficiently fill up vacant cluster-capacity, but I can't manage to follow this part in the code as everything crumbles into separate files that I can't link correctly in my head.
Thanks in advance.
referenced docs: https://slurm.schedmd.com/topology.html https://hpc.rz.rptu.de/documentation/topology_plugin.html https://github.com/SchedMD/slurm/tree/master/src/plugins/topology/common