<div dir="ltr"><div>Hi Patrick,</div><div><br></div><div>I have seen a similar error while configuring native X-forwarding in Slurm. It was caused by Slurm sending an IP to the compute node (as part of a message) that was not routable back to the controller host. In my case it was because the controller host was multihomed, and I had misconfigured ControlMachine= in slurm.conf to a hostname associated with the wrong network interface. If your controller host has multiple network interfaces, you might want to check that all IPs associated with the controller have routes back from the compute node.</div><div><br></div><div>-Sean<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Nov 12, 2020 at 7:40 AM Patrick Bégou <<a href="mailto:Patrick.Begou@legi.grenoble-inp.fr">Patrick.Begou@legi.grenoble-inp.fr</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<div>Hi slurm admins and developpers,</div>
<div><br>
</div>
<div>no one has an idea about this problem ?</div>
<div><br>
</div>
<div>Still investigating this morning I
discover that it works from the management node (a small VM
running slurmctld) even if I have no home directory on it (I use a
su command from root to gain unprivileged user setup). It still
doesn't run from the login node even with all firewall disabled
:-( <br>
</div>
<div><br>
</div>
<div>Patrick<br>
</div>
<div><br>
</div>
<div>Le 10/11/2020 à 11:54, Patrick Bégou a
écrit :<br>
</div>
<blockquote type="cite">
<p>Hi,</p>
<p>I'm new to slurm (as admin) and I need some help. Testing my
initial setup with:</p>
<blockquote>[begou@tenibre ~]$ <b>salloc -n 1 sh</b><br>
salloc: Granted job allocation 11<br>
sh-4.4$ <b>squeue</b><br>
JOBID PARTITION NAME USER ST TIME
NODES NODELIST(REASON)<br>
<b>11 </b> all sh begou R
0:16 1 tenibre-0-0<br>
sh-4.4$<b> srun /usr/bin/hostname</b><br>
srun: error: timeout waiting for task launch, started 0 of 1
tasks<br>
srun: Job step 11.0 aborted before step completely launched.<br>
srun: Job step aborted: Waiting up to 32 seconds for job step to
finish.<br>
srun: error: Timed out waiting for job step to complete<br>
</blockquote>
<p>I check the connections:<br>
</p>
<p><b>tenibre is the login node</b> (no daemon running)</p>
<blockquote>nc -v tenibre-0-0 6818 <br>
nc -v management1 6817<br>
</blockquote>
<b>management1 is the management node</b> (slurmctld running)<br>
<blockquote>nc -v tenibre-0-0 6818<br>
</blockquote>
<b>tenibre-0-0 is the first compute node</b> (slurmd running)
<blockquote>
<p>nc -v management1 6817<br>
</p>
</blockquote>
<p>All tests return "<i>Ncat: Connected...</i>"</p>
<p>The command "id begou" works on all nodes and I can reach my
home directory on the login node and on the compute node.</p>
<p>On the compute node slurmd.log shows:</p>
<blockquote>[2020-11-10T11:21:38.050]<b> launch task</b> <b>11.0
</b>request from UID:23455 GID:1036 HOST:172.30.1.254 PORT:42220<br>
[2020-11-10T11:21:38.050] debug: Checking credential with 508
bytes of sig data<br>
[2020-11-10T11:21:38.050] _run_prolog: run job script took
usec=12<br>
[2020-11-10T11:21:38.050] _run_prolog: prolog with lock for job
11 ran for 0 seconds<br>
[2020-11-10T11:21:38.053] debug: AcctGatherEnergy NONE plugin
loaded<br>
[2020-11-10T11:21:38.053] debug: AcctGatherProfile NONE plugin
loaded<br>
[2020-11-10T11:21:38.053] debug: AcctGatherInterconnect NONE
plugin loaded<br>
[2020-11-10T11:21:38.053] debug: AcctGatherFilesystem NONE
plugin loaded<br>
[2020-11-10T11:21:38.053] debug: switch NONE plugin loaded<br>
[2020-11-10T11:21:38.054] [11.0] debug: Job accounting gather
NOT_INVOKED plugin loaded<br>
[2020-11-10T11:21:38.054] [11.0] debug: Message thread started
pid = 12099<br>
[2020-11-10T11:21:38.054] debug:
task_p_slurmd_reserve_resources: 11 0<br>
[2020-11-10T11:21:38.068] [11.0] debug: task NONE plugin loaded<br>
[2020-11-10T11:21:38.068] [11.0] debug: Checkpoint plugin
loaded: checkpoint/none<br>
[2020-11-10T11:21:38.068] [11.0] Munge credential signature
plugin loaded<br>
[2020-11-10T11:21:38.068] [11.0] debug: job_container none
plugin loaded<br>
[2020-11-10T11:21:38.068] [11.0] debug: mpi type = pmi2<br>
[2020-11-10T11:21:38.068] [11.0] debug: xcgroup_instantiate:
cgroup '/sys/fs/cgroup/freezer/slurm' already exists<br>
[2020-11-10T11:21:38.068] [11.0] debug: spank: opening plugin
stack /etc/slurm/plugstack.conf<br>
[2020-11-10T11:21:38.068] [11.0] debug: mpi type = (null)<br>
[2020-11-10T11:21:38.068] [11.0] debug: using mpi/pmi2<br>
[2020-11-10T11:21:38.068] [11.0] debug: _setup_stepd_job_info:
SLURM_STEP_RESV_PORTS not found in env<br>
[2020-11-10T11:21:38.068] [11.0] debug: mpi/pmi2: setup sockets<br>
[2020-11-10T11:21:38.069] [11.0] debug: mpi/pmi2: started agent
thread<br>
[2020-11-10T11:21:38.069] [11.0]<b> error: connect io: No route
to host</b><br>
[2020-11-10T11:21:38.069] [11.0] error: IO setup failed: No
route to host<br>
[2020-11-10T11:21:38.069] [11.0] debug:
step_terminate_monitor_stop signaling condition<br>
[2020-11-10T11:21:38.069] [11.0] error: job_manager exiting
abnormally, rc = 4021<br>
[2020-11-10T11:21:38.069] [11.0] debug: Sending launch resp
rc=4021<br>
[2020-11-10T11:21:38.069] [11.0] debug: _send_srun_resp_msg:
0/5 <b>failed to send msg type 6002: No route to host</b><br>
[2020-11-10T11:21:38.169] [11.0] debug: _send_srun_resp_msg:
1/5 failed to send msg type 6002: No route to host<br>
[2020-11-10T11:21:38.370] [11.0] debug: _send_srun_resp_msg:
2/5 failed to send msg type 6002: No route to host<br>
[2020-11-10T11:21:38.770] [11.0] debug: _send_srun_resp_msg:
3/5 failed to send msg type 6002: No route to host<br>
[2020-11-10T11:21:39.570] [11.0] debug: _send_srun_resp_msg:
4/5 failed to send msg type 6002: No route to host<br>
[2020-11-10T11:21:40.370] [11.0] debug: _send_srun_resp_msg:
5/5 failed to send msg type 6002: No route to host<br>
[2020-11-10T11:21:40.372] [11.0] debug: Message thread exited<br>
[2020-11-10T11:21:40.372] [11.0] debug: mpi/pmi2: agent thread
exit<br>
[2020-11-10T11:21:40.372] [11.0] <b>done with job</b></blockquote>
<p><br>
</p>
<p>But I do not understand what this "No route to host" means.</p>
<p><br>
</p>
<p>Thanks for your help.<br>
</p>
<p>Patrick<br>
</p>
<p><br>
</p>
</blockquote>
<p><br>
</p>
</div>
</blockquote></div>