[slurm-users] Scheduling problems with topology plugin

Phil Chiu whophilchiu at gmail.com
Sat May 14 15:39:57 UTC 2022


I have a cluster with two "islands" of nodes. All nodes on each island are
connected by InfiniBand, but there is no InfiniBand connection between the
two islands.

I have a single partition for all nodes, and am trying to use the topology
plugin to ensure that multi-node jobs are scheduled either entirely on one
island or the other.

The topology.conf file is shown below:

# backplane switches
SwitchName=c1 Nodes=c1-a[1-10],c1-b[1-10]
SwitchName=c2 Nodes=c2-a[1-10],c2-b[1-10]
SwitchName=c3 Nodes=c3-a[1-10],c3-b[1-10]
SwitchName=c4 Nodes=c4-a[1-10],c4-b[1-10]
SwitchName=c5 Nodes=c5-a[1-10],c5-b[1-10]
SwitchName=c6 Nodes=c6-a[1-10],c6-b[1-10]
SwitchName=c7 Nodes=c7-a[1-10],c7-b[1-10]

# imaginary switch connecting c1/c2 backplane switches
# dummy switches are added to make the number of "levels" the same
SwitchName=c1_dummy Switches=c1
SwitchName=c2_dummy Switches=c2
SwitchName=s0 Switches=c1_dummy,c2_dummy

# imaginary single switch connecting c3-c7
# as recommended by https://slurm.schedmd.com/topology.html
SwitchName=s1 Switches=c[3-7]

## actual switch topology
# SwitchName=s1 Switches=c[3-7]
# SwitchName=s2 Switches=c[3-7]
# SwitchName=s3 Switches=c[3-7]
# SwitchName=s4 Switches=c[3-7]

I am running into a strange scheduling issue. If there are ANY nodes
available on the first island (s0), then the topology plugin will always
try to place them on the first island, even if the total required number of
nodes is not available. This results in getting stuck at PD (Resources),
even when they could run on the second island.

I'd be extremely thankful for any help in understanding what is happening.
Thanks!

Below is some additional debugging info.

slurmctld.log seems to show that even jobs which are currently allocated
are being considered for the job. Here is the state of allocation:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
normal*      up 5-00:00:00     76  alloc
c1-a[1-10],c1-b[1-10],c2-a[1-10],c2-b[1-6,9-10],c3-a[1-5,10],c3-b[1-2],c4-a[1-10],c5-a[5-10],c5-b[1-2],c7-a[1-10],c7-b[8-9]
normal*      up 5-00:00:00     64   idle
c2-b[7-8],c3-a[6-9],c3-b[3-10],c4-b[1-10],c5-a[1-4],c5-b[3-10],c6-a[1-10],c6-b[1-10],c7-b[1-7,10]

The stuck job is 82663

JOBID PARTITION  ST       TIME  NODES NODELIST(REASON)
82663    normal  PD       0:00     10 (Priority)
82660    normal  PD       0:00      4 (Resources)
82661    normal  PD       0:00      4 (Priority)
82658    normal   R      38:29      4 c2-a[2-5]
82659    normal   R      38:29      4 c1-a[5-8]
82641    normal   R    1:26:19     10 c1-b[6-10],c2-b[2-5,9]
82640    normal   R    1:37:37     10 c1-b[4-5],c2-a[6-10],c2-b[1,6,10]
82637    normal   R    1:42:29      4 c3-a[5,10],c3-b[1-2]
82638    normal   R    1:42:29      4 c5-a[9-10],c5-b[1-2]
82635    normal   R    1:42:32      4 c3-a[1-4]
82636    normal   R    1:42:32      4 c5-a[5-8]
82342    normal   R    3:34:32     10 c7-a[1-10]
82339    normal   R    3:42:02     10 c4-a[1-10]
82337    normal   R    3:46:12     10 c1-a[1-4,9-10],c1-b[1-3],c2-a1
82662      test   R       6:33      2 c7-b[8-9]

And here is part of slurmctld.log

[2022-05-13T12:43:49.441] select/cons_tres: core_array_log:
_select_nodes/enter
[2022-05-13T12:43:49.441] select/cons_tres: core_array_log:
node_list:c1-a[5-8],c2-b[7-8],c3-a[1-10],c3-b[1-10],c4-b[1-10],c5-a[1-10],c5-b[1-10],c6-a[1-10],c6-b[1-10],c7-b[1-10]
[2022-05-13T12:43:49.441] select/cons_tres: core_array_log:
core_list:node[4]:0-63,node[5]:0-63,node[6]:0-63,node[7]:0-63,node[36]:0-63,node[37]:0-63,node[40]:0-63,node[41]:0-63,node[42]:0-63,node[43]:0-63,node[44]:0-63,node[45]:0-63,node[46]:0-63,node[47]:0-63,node[48]:0-63,node[49]:0-63,node[50]:0-63,node[51]:0-63,node[52]:0-63,node[53]:0-63,node[54]:0-63,node[55]:0-63,node[56]:0-63,node[57]:0-63,node[58]:0-63,node[59]:0-63,node[70]:0-63,node[71]:0-63,node[72]:0-63,node[73]:0-63,node[74]:0-63,node[75]:0-63,node[76]:0-63,node[77]:0-63,node[78]:0-63,node[79]:0-63,node[80]:0-63,node[81]:0-63,node[82]:0-63,node[83]:0-63,node[84]:0-63,node[85]:0-63,node[86]:0-63,node[87]:0-63,node[88]:0-63,node[89]:0-63,node[90]:0-63,node[91]:0-63,node[92]:0-63,node[93]:0-63,node[94]:0-63,node[95]:0-63,node[96]:0-63,node[97]:0-63,node[98]:0-63,node[99]:0-63,node[100]:0-63,node[101]:0-63,node[102]:0-63,node[103]:0-63,node[104]:0-63,node[105]:0-63,node[106]:0-63,node[107]:0-63,node[108]:0-63,node[109]:0-63,node[110]:0-63,node[111]:0-63,node[112]:0-63,node[113]:0-63,node[114]:0-63,node[115]:0-63,node[116]:0-63,node[117]:0-63,node[118]:0-63,node[119]:0-63,node[130]:0-63,node[131]:0-63,node[132]:0-63,node[133]:0-63,node[134]:0-63,node[135]:0-63,node[136]:0-63,node[137]:0-63,node[138]:0-63,node[139]:0-63
[2022-05-13T12:43:49.441] select/cons_tres: core_array_log:
_select_nodes/elim_nodes
[2022-05-13T12:43:49.441] select/cons_tres: core_array_log:
node_list:c1-a[5-8],c2-b[7-8],c3-a[1-10],c3-b[1-10],c4-b[1-10],c5-a[1-10],c5-b[1-10],c6-a[1-10],c6-b[1-10],c7-b[1-10]
[2022-05-13T12:43:49.442] select/cons_tres: core_array_log:
core_list:node[4]:0-63,node[5]:0-63,node[6]:0-63,node[7]:0-63,node[36]:0-63,node[37]:0-63,node[40]:0-63,node[41]:0-63,node[42]:0-63,node[43]:0-63,node[44]:0-63,node[45]:0-63,node[46]:0-63,node[47]:0-63,node[48]:0-63,node[49]:0-63,node[50]:0-63,node[51]:0-63,node[52]:0-63,node[53]:0-63,node[54]:0-63,node[55]:0-63,node[56]:0-63,node[57]:0-63,node[58]:0-63,node[59]:0-63,node[70]:0-63,node[71]:0-63,node[72]:0-63,node[73]:0-63,node[74]:0-63,node[75]:0-63,node[76]:0-63,node[77]:0-63,node[78]:0-63,node[79]:0-63,node[80]:0-63,node[81]:0-63,node[82]:0-63,node[83]:0-63,node[84]:0-63,node[85]:0-63,node[86]:0-63,node[87]:0-63,node[88]:0-63,node[89]:0-63,node[90]:0-63,node[91]:0-63,node[92]:0-63,node[93]:0-63,node[94]:0-63,node[95]:0-63,node[96]:0-63,node[97]:0-63,node[98]:0-63,node[99]:0-63,node[100]:0-63,node[101]:0-63,node[102]:0-63,node[103]:0-63,node[104]:0-63,node[105]:0-63,node[106]:0-63,node[107]:0-63,node[108]:0-63,node[109]:0-63,node[110]:0-63,node[111]:0-63,node[112]:0-63,node[113]:0-63,node[114]:0-63,node[115]:0-63,node[116]:0-63,node[117]:0-63,node[118]:0-63,node[119]:0-63,node[130]:0-63,node[131]:0-63,node[132]:0-63,node[133]:0-63,node[134]:0-63,node[135]:0-63,node[136]:0-63,node[137]:0-63,node[138]:0-63,node[139]:0-63
[2022-05-13T12:43:49.442] select/cons_tres: _topo_weight_log:
Topo:c1-a[5-8],c2-b[7-8],c3-a[1-10],c3-b[1-10],c4-b[1-10],c5-a[1-10],c5-b[1-10],c6-a[1-10],c6-b[1-10],c7-b[1-10]
weight:511
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: Best
nodes:c1-a[5-8],c2-b[7-8] node_cnt:6 cpu_cnt:384
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c1
level=0 nodes=4:c1-a[5-8] required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c2
level=0 nodes=2:c2-b[7-8] required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c3
level=0 nodes=0:(null) required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c4
level=0 nodes=0:(null) required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c5
level=0 nodes=0:(null) required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c6
level=0 nodes=0:(null) required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c7
level=0 nodes=0:(null) required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo:
switch=c1_dummy level=1 nodes=4:c1-a[5-8] required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo:
switch=c2_dummy level=1 nodes=2:c2-b[7-8] required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=s0
level=2 nodes=6:c1-a[5-8],c2-b[7-8] required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=s1
level=1 nodes=0:(null) required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: core_array_log:
_select_nodes/choose_nodes
[2022-05-13T12:43:49.442] select/cons_tres: core_array_log:
node_list:c1-a[5-8]
[2022-05-13T12:43:49.442] select/cons_tres: core_array_log:
core_list:node[4]:0-63,node[5]:0-63,node[6]:0-63,node[7]:0-63,node[36]:0-63,node[37]:0-63,node[40]:0-63,node[41]:0-63,node[42]:0-63,node[43]:0-63,node[44]:0-63,node[45]:0-63,node[46]:0-63,node[47]:0-63,node[48]:0-63,node[49]:0-63,node[50]:0-63,node[51]:0-63,node[52]:0-63,node[53]:0-63,node[54]:0-63,node[55]:0-63,node[56]:0-63,node[57]:0-63,node[58]:0-63,node[59]:0-63,node[70]:0-63,node[71]:0-63,node[72]:0-63,node[73]:0-63,node[74]:0-63,node[75]:0-63,node[76]:0-63,node[77]:0-63,node[78]:0-63,node[79]:0-63,node[80]:0-63,node[81]:0-63,node[82]:0-63,node[83]:0-63,node[84]:0-63,node[85]:0-63,node[86]:0-63,node[87]:0-63,node[88]:0-63,node[89]:0-63,node[90]:0-63,node[91]:0-63,node[92]:0-63,node[93]:0-63,node[94]:0-63,node[95]:0-63,node[96]:0-63,node[97]:0-63,node[98]:0-63,node[99]:0-63,node[100]:0-63,node[101]:0-63,node[102]:0-63,node[103]:0-63,node[104]:0-63,node[105]:0-63,node[106]:0-63,node[107]:0-63,node[108]:0-63,node[109]:0-63,node[110]:0-63,node[111]:0-63,node[112]:0-63,node[113]:0-63,node[114]:0-63,node[115]:0-63,node[116]:0-63,node[117]:0-63,node[118]:0-63,node[119]:0-63,node[130]:0-63,node[131]:0-63,node[132]:0-63,node[133]:0-63,node[134]:0-63,node[135]:0-63,node[136]:0-63,node[137]:0-63,node[138]:0-63,node[139]:0-63
[2022-05-13T12:43:49.442] select/cons_tres: core_array_log:
_select_nodes/sync_cores
[2022-05-13T12:43:49.442] select/cons_tres: core_array_log:
node_list:c1-a[5-8]
[2022-05-13T12:43:49.442] select/cons_tres: core_array_log:
core_list:node[4]:0-63,node[5]:0-63,node[6]:0-63,node[7]:0-63,node[36]:0-63,node[37]:0-63,node[40]:0-63,node[41]:0-63,node[42]:0-63,node[43]:0-63,node[44]:0-63,node[45]:0-63,node[46]:0-63,node[47]:0-63,node[48]:0-63,node[49]:0-63,node[50]:0-63,node[51]:0-63,node[52]:0-63,node[53]:0-63,node[54]:0-63,node[55]:0-63,node[56]:0-63,node[57]:0-63,node[58]:0-63,node[59]:0-63,node[70]:0-63,node[71]:0-63,node[72]:0-63,node[73]:0-63,node[74]:0-63,node[75]:0-63,node[76]:0-63,node[77]:0-63,node[78]:0-63,node[79]:0-63,node[80]:0-63,node[81]:0-63,node[82]:0-63,node[83]:0-63,node[84]:0-63,node[85]:0-63,node[86]:0-63,node[87]:0-63,node[88]:0-63,node[89]:0-63,node[90]:0-63,node[91]:0-63,node[92]:0-63,node[93]:0-63,node[94]:0-63,node[95]:0-63,node[96]:0-63,node[97]:0-63,node[98]:0-63,node[99]:0-63,node[100]:0-63,node[101]:0-63,node[102]:0-63,node[103]:0-63,node[104]:0-63,node[105]:0-63,node[106]:0-63,node[107]:0-63,node[108]:0-63,node[109]:0-63,node[110]:0-63,node[111]:0-63,node[112]:0-63,node[113]:0-63,node[114]:0-63,node[115]:0-63,node[116]:0-63,node[117]:0-63,node[118]:0-63,node[119]:0-63,node[130]:0-63,node[131]:0-63,node[132]:0-63,node[133]:0-63,node[134]:0-63,node[135]:0-63,node[136]:0-63,node[137]:0-63,node[138]:0-63,node[139]:0-63
[2022-05-13T12:43:49.442] select/cons_tres: common_job_test: no
job_resources info for JobId=82661 rc=0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220514/6ef79725/attachment.htm>


More information about the slurm-users mailing list