[slurm-users] Scheduling problems with topology plugin
Phil Chiu
whophilchiu at gmail.com
Sat May 14 15:39:57 UTC 2022
I have a cluster with two "islands" of nodes. All nodes on each island are
connected by InfiniBand, but there is no InfiniBand connection between the
two islands.
I have a single partition for all nodes, and am trying to use the topology
plugin to ensure that multi-node jobs are scheduled either entirely on one
island or the other.
The topology.conf file is shown below:
# backplane switches
SwitchName=c1 Nodes=c1-a[1-10],c1-b[1-10]
SwitchName=c2 Nodes=c2-a[1-10],c2-b[1-10]
SwitchName=c3 Nodes=c3-a[1-10],c3-b[1-10]
SwitchName=c4 Nodes=c4-a[1-10],c4-b[1-10]
SwitchName=c5 Nodes=c5-a[1-10],c5-b[1-10]
SwitchName=c6 Nodes=c6-a[1-10],c6-b[1-10]
SwitchName=c7 Nodes=c7-a[1-10],c7-b[1-10]
# imaginary switch connecting c1/c2 backplane switches
# dummy switches are added to make the number of "levels" the same
SwitchName=c1_dummy Switches=c1
SwitchName=c2_dummy Switches=c2
SwitchName=s0 Switches=c1_dummy,c2_dummy
# imaginary single switch connecting c3-c7
# as recommended by https://slurm.schedmd.com/topology.html
SwitchName=s1 Switches=c[3-7]
## actual switch topology
# SwitchName=s1 Switches=c[3-7]
# SwitchName=s2 Switches=c[3-7]
# SwitchName=s3 Switches=c[3-7]
# SwitchName=s4 Switches=c[3-7]
I am running into a strange scheduling issue. If there are ANY nodes
available on the first island (s0), then the topology plugin will always
try to place them on the first island, even if the total required number of
nodes is not available. This results in getting stuck at PD (Resources),
even when they could run on the second island.
I'd be extremely thankful for any help in understanding what is happening.
Thanks!
Below is some additional debugging info.
slurmctld.log seems to show that even jobs which are currently allocated
are being considered for the job. Here is the state of allocation:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 5-00:00:00 76 alloc
c1-a[1-10],c1-b[1-10],c2-a[1-10],c2-b[1-6,9-10],c3-a[1-5,10],c3-b[1-2],c4-a[1-10],c5-a[5-10],c5-b[1-2],c7-a[1-10],c7-b[8-9]
normal* up 5-00:00:00 64 idle
c2-b[7-8],c3-a[6-9],c3-b[3-10],c4-b[1-10],c5-a[1-4],c5-b[3-10],c6-a[1-10],c6-b[1-10],c7-b[1-7,10]
The stuck job is 82663
JOBID PARTITION ST TIME NODES NODELIST(REASON)
82663 normal PD 0:00 10 (Priority)
82660 normal PD 0:00 4 (Resources)
82661 normal PD 0:00 4 (Priority)
82658 normal R 38:29 4 c2-a[2-5]
82659 normal R 38:29 4 c1-a[5-8]
82641 normal R 1:26:19 10 c1-b[6-10],c2-b[2-5,9]
82640 normal R 1:37:37 10 c1-b[4-5],c2-a[6-10],c2-b[1,6,10]
82637 normal R 1:42:29 4 c3-a[5,10],c3-b[1-2]
82638 normal R 1:42:29 4 c5-a[9-10],c5-b[1-2]
82635 normal R 1:42:32 4 c3-a[1-4]
82636 normal R 1:42:32 4 c5-a[5-8]
82342 normal R 3:34:32 10 c7-a[1-10]
82339 normal R 3:42:02 10 c4-a[1-10]
82337 normal R 3:46:12 10 c1-a[1-4,9-10],c1-b[1-3],c2-a1
82662 test R 6:33 2 c7-b[8-9]
And here is part of slurmctld.log
[2022-05-13T12:43:49.441] select/cons_tres: core_array_log:
_select_nodes/enter
[2022-05-13T12:43:49.441] select/cons_tres: core_array_log:
node_list:c1-a[5-8],c2-b[7-8],c3-a[1-10],c3-b[1-10],c4-b[1-10],c5-a[1-10],c5-b[1-10],c6-a[1-10],c6-b[1-10],c7-b[1-10]
[2022-05-13T12:43:49.441] select/cons_tres: core_array_log:
core_list:node[4]:0-63,node[5]:0-63,node[6]:0-63,node[7]:0-63,node[36]:0-63,node[37]:0-63,node[40]:0-63,node[41]:0-63,node[42]:0-63,node[43]:0-63,node[44]:0-63,node[45]:0-63,node[46]:0-63,node[47]:0-63,node[48]:0-63,node[49]:0-63,node[50]:0-63,node[51]:0-63,node[52]:0-63,node[53]:0-63,node[54]:0-63,node[55]:0-63,node[56]:0-63,node[57]:0-63,node[58]:0-63,node[59]:0-63,node[70]:0-63,node[71]:0-63,node[72]:0-63,node[73]:0-63,node[74]:0-63,node[75]:0-63,node[76]:0-63,node[77]:0-63,node[78]:0-63,node[79]:0-63,node[80]:0-63,node[81]:0-63,node[82]:0-63,node[83]:0-63,node[84]:0-63,node[85]:0-63,node[86]:0-63,node[87]:0-63,node[88]:0-63,node[89]:0-63,node[90]:0-63,node[91]:0-63,node[92]:0-63,node[93]:0-63,node[94]:0-63,node[95]:0-63,node[96]:0-63,node[97]:0-63,node[98]:0-63,node[99]:0-63,node[100]:0-63,node[101]:0-63,node[102]:0-63,node[103]:0-63,node[104]:0-63,node[105]:0-63,node[106]:0-63,node[107]:0-63,node[108]:0-63,node[109]:0-63,node[110]:0-63,node[111]:0-63,node[112]:0-63,node[113]:0-63,node[114]:0-63,node[115]:0-63,node[116]:0-63,node[117]:0-63,node[118]:0-63,node[119]:0-63,node[130]:0-63,node[131]:0-63,node[132]:0-63,node[133]:0-63,node[134]:0-63,node[135]:0-63,node[136]:0-63,node[137]:0-63,node[138]:0-63,node[139]:0-63
[2022-05-13T12:43:49.441] select/cons_tres: core_array_log:
_select_nodes/elim_nodes
[2022-05-13T12:43:49.441] select/cons_tres: core_array_log:
node_list:c1-a[5-8],c2-b[7-8],c3-a[1-10],c3-b[1-10],c4-b[1-10],c5-a[1-10],c5-b[1-10],c6-a[1-10],c6-b[1-10],c7-b[1-10]
[2022-05-13T12:43:49.442] select/cons_tres: core_array_log:
core_list:node[4]:0-63,node[5]:0-63,node[6]:0-63,node[7]:0-63,node[36]:0-63,node[37]:0-63,node[40]:0-63,node[41]:0-63,node[42]:0-63,node[43]:0-63,node[44]:0-63,node[45]:0-63,node[46]:0-63,node[47]:0-63,node[48]:0-63,node[49]:0-63,node[50]:0-63,node[51]:0-63,node[52]:0-63,node[53]:0-63,node[54]:0-63,node[55]:0-63,node[56]:0-63,node[57]:0-63,node[58]:0-63,node[59]:0-63,node[70]:0-63,node[71]:0-63,node[72]:0-63,node[73]:0-63,node[74]:0-63,node[75]:0-63,node[76]:0-63,node[77]:0-63,node[78]:0-63,node[79]:0-63,node[80]:0-63,node[81]:0-63,node[82]:0-63,node[83]:0-63,node[84]:0-63,node[85]:0-63,node[86]:0-63,node[87]:0-63,node[88]:0-63,node[89]:0-63,node[90]:0-63,node[91]:0-63,node[92]:0-63,node[93]:0-63,node[94]:0-63,node[95]:0-63,node[96]:0-63,node[97]:0-63,node[98]:0-63,node[99]:0-63,node[100]:0-63,node[101]:0-63,node[102]:0-63,node[103]:0-63,node[104]:0-63,node[105]:0-63,node[106]:0-63,node[107]:0-63,node[108]:0-63,node[109]:0-63,node[110]:0-63,node[111]:0-63,node[112]:0-63,node[113]:0-63,node[114]:0-63,node[115]:0-63,node[116]:0-63,node[117]:0-63,node[118]:0-63,node[119]:0-63,node[130]:0-63,node[131]:0-63,node[132]:0-63,node[133]:0-63,node[134]:0-63,node[135]:0-63,node[136]:0-63,node[137]:0-63,node[138]:0-63,node[139]:0-63
[2022-05-13T12:43:49.442] select/cons_tres: _topo_weight_log:
Topo:c1-a[5-8],c2-b[7-8],c3-a[1-10],c3-b[1-10],c4-b[1-10],c5-a[1-10],c5-b[1-10],c6-a[1-10],c6-b[1-10],c7-b[1-10]
weight:511
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: Best
nodes:c1-a[5-8],c2-b[7-8] node_cnt:6 cpu_cnt:384
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c1
level=0 nodes=4:c1-a[5-8] required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c2
level=0 nodes=2:c2-b[7-8] required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c3
level=0 nodes=0:(null) required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c4
level=0 nodes=0:(null) required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c5
level=0 nodes=0:(null) required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c6
level=0 nodes=0:(null) required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=c7
level=0 nodes=0:(null) required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo:
switch=c1_dummy level=1 nodes=4:c1-a[5-8] required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo:
switch=c2_dummy level=1 nodes=2:c2-b[7-8] required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=s0
level=2 nodes=6:c1-a[5-8],c2-b[7-8] required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: _eval_nodes_topo: switch=s1
level=1 nodes=0:(null) required:0 speed:1
[2022-05-13T12:43:49.442] select/cons_tres: core_array_log:
_select_nodes/choose_nodes
[2022-05-13T12:43:49.442] select/cons_tres: core_array_log:
node_list:c1-a[5-8]
[2022-05-13T12:43:49.442] select/cons_tres: core_array_log:
core_list:node[4]:0-63,node[5]:0-63,node[6]:0-63,node[7]:0-63,node[36]:0-63,node[37]:0-63,node[40]:0-63,node[41]:0-63,node[42]:0-63,node[43]:0-63,node[44]:0-63,node[45]:0-63,node[46]:0-63,node[47]:0-63,node[48]:0-63,node[49]:0-63,node[50]:0-63,node[51]:0-63,node[52]:0-63,node[53]:0-63,node[54]:0-63,node[55]:0-63,node[56]:0-63,node[57]:0-63,node[58]:0-63,node[59]:0-63,node[70]:0-63,node[71]:0-63,node[72]:0-63,node[73]:0-63,node[74]:0-63,node[75]:0-63,node[76]:0-63,node[77]:0-63,node[78]:0-63,node[79]:0-63,node[80]:0-63,node[81]:0-63,node[82]:0-63,node[83]:0-63,node[84]:0-63,node[85]:0-63,node[86]:0-63,node[87]:0-63,node[88]:0-63,node[89]:0-63,node[90]:0-63,node[91]:0-63,node[92]:0-63,node[93]:0-63,node[94]:0-63,node[95]:0-63,node[96]:0-63,node[97]:0-63,node[98]:0-63,node[99]:0-63,node[100]:0-63,node[101]:0-63,node[102]:0-63,node[103]:0-63,node[104]:0-63,node[105]:0-63,node[106]:0-63,node[107]:0-63,node[108]:0-63,node[109]:0-63,node[110]:0-63,node[111]:0-63,node[112]:0-63,node[113]:0-63,node[114]:0-63,node[115]:0-63,node[116]:0-63,node[117]:0-63,node[118]:0-63,node[119]:0-63,node[130]:0-63,node[131]:0-63,node[132]:0-63,node[133]:0-63,node[134]:0-63,node[135]:0-63,node[136]:0-63,node[137]:0-63,node[138]:0-63,node[139]:0-63
[2022-05-13T12:43:49.442] select/cons_tres: core_array_log:
_select_nodes/sync_cores
[2022-05-13T12:43:49.442] select/cons_tres: core_array_log:
node_list:c1-a[5-8]
[2022-05-13T12:43:49.442] select/cons_tres: core_array_log:
core_list:node[4]:0-63,node[5]:0-63,node[6]:0-63,node[7]:0-63,node[36]:0-63,node[37]:0-63,node[40]:0-63,node[41]:0-63,node[42]:0-63,node[43]:0-63,node[44]:0-63,node[45]:0-63,node[46]:0-63,node[47]:0-63,node[48]:0-63,node[49]:0-63,node[50]:0-63,node[51]:0-63,node[52]:0-63,node[53]:0-63,node[54]:0-63,node[55]:0-63,node[56]:0-63,node[57]:0-63,node[58]:0-63,node[59]:0-63,node[70]:0-63,node[71]:0-63,node[72]:0-63,node[73]:0-63,node[74]:0-63,node[75]:0-63,node[76]:0-63,node[77]:0-63,node[78]:0-63,node[79]:0-63,node[80]:0-63,node[81]:0-63,node[82]:0-63,node[83]:0-63,node[84]:0-63,node[85]:0-63,node[86]:0-63,node[87]:0-63,node[88]:0-63,node[89]:0-63,node[90]:0-63,node[91]:0-63,node[92]:0-63,node[93]:0-63,node[94]:0-63,node[95]:0-63,node[96]:0-63,node[97]:0-63,node[98]:0-63,node[99]:0-63,node[100]:0-63,node[101]:0-63,node[102]:0-63,node[103]:0-63,node[104]:0-63,node[105]:0-63,node[106]:0-63,node[107]:0-63,node[108]:0-63,node[109]:0-63,node[110]:0-63,node[111]:0-63,node[112]:0-63,node[113]:0-63,node[114]:0-63,node[115]:0-63,node[116]:0-63,node[117]:0-63,node[118]:0-63,node[119]:0-63,node[130]:0-63,node[131]:0-63,node[132]:0-63,node[133]:0-63,node[134]:0-63,node[135]:0-63,node[136]:0-63,node[137]:0-63,node[138]:0-63,node[139]:0-63
[2022-05-13T12:43:49.442] select/cons_tres: common_job_test: no
job_resources info for JobId=82661 rc=0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20220514/6ef79725/attachment.htm>
More information about the slurm-users
mailing list