<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">Hi slurm admins and developpers,</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">no one has an idea about this problem ?</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Still investigating this morning I
discover that it works from the management node (a small VM
running slurmctld) even if I have no home directory on it (I use a
su command from root to gain unprivileged user setup). It still
doesn't run from the login node even with all firewall disabled
:-( <br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Patrick<br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Le 10/11/2020 à 11:54, Patrick Bégou a
écrit :<br>
</div>
<blockquote type="cite"
cite="mid:2fbf6c99-4966-294c-03de-8eb356b0a8e3@legi.grenoble-inp.fr">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<p>Hi,</p>
<p>I'm new to slurm (as admin) and I need some help. Testing my
initial setup with:</p>
<blockquote>[begou@tenibre ~]$ <b>salloc -n 1 sh</b><br>
salloc: Granted job allocation 11<br>
sh-4.4$ <b>squeue</b><br>
JOBID PARTITION NAME USER ST TIME
NODES NODELIST(REASON)<br>
<b>11 </b> all sh begou R
0:16 1 tenibre-0-0<br>
sh-4.4$<b> srun /usr/bin/hostname</b><br>
srun: error: timeout waiting for task launch, started 0 of 1
tasks<br>
srun: Job step 11.0 aborted before step completely launched.<br>
srun: Job step aborted: Waiting up to 32 seconds for job step to
finish.<br>
srun: error: Timed out waiting for job step to complete<br>
</blockquote>
<p>I check the connections:<br>
</p>
<p><b>tenibre is the login node</b> (no daemon running)</p>
<blockquote>nc -v tenibre-0-0 6818 <br>
nc -v management1 6817<br>
</blockquote>
<b>management1 is the management node</b> (slurmctld running)<br>
<blockquote>nc -v tenibre-0-0 6818<br>
</blockquote>
<b>tenibre-0-0 is the first compute node</b> (slurmd running)
<blockquote>
<p>nc -v management1 6817<br>
</p>
</blockquote>
<p>All tests return "<i>Ncat: Connected...</i>"</p>
<p>The command "id begou" works on all nodes and I can reach my
home directory on the login node and on the compute node.</p>
<p>On the compute node slurmd.log shows:</p>
<blockquote>[2020-11-10T11:21:38.050]<b> launch task</b> <b>11.0
</b>request from UID:23455 GID:1036 HOST:172.30.1.254 PORT:42220<br>
[2020-11-10T11:21:38.050] debug: Checking credential with 508
bytes of sig data<br>
[2020-11-10T11:21:38.050] _run_prolog: run job script took
usec=12<br>
[2020-11-10T11:21:38.050] _run_prolog: prolog with lock for job
11 ran for 0 seconds<br>
[2020-11-10T11:21:38.053] debug: AcctGatherEnergy NONE plugin
loaded<br>
[2020-11-10T11:21:38.053] debug: AcctGatherProfile NONE plugin
loaded<br>
[2020-11-10T11:21:38.053] debug: AcctGatherInterconnect NONE
plugin loaded<br>
[2020-11-10T11:21:38.053] debug: AcctGatherFilesystem NONE
plugin loaded<br>
[2020-11-10T11:21:38.053] debug: switch NONE plugin loaded<br>
[2020-11-10T11:21:38.054] [11.0] debug: Job accounting gather
NOT_INVOKED plugin loaded<br>
[2020-11-10T11:21:38.054] [11.0] debug: Message thread started
pid = 12099<br>
[2020-11-10T11:21:38.054] debug:
task_p_slurmd_reserve_resources: 11 0<br>
[2020-11-10T11:21:38.068] [11.0] debug: task NONE plugin loaded<br>
[2020-11-10T11:21:38.068] [11.0] debug: Checkpoint plugin
loaded: checkpoint/none<br>
[2020-11-10T11:21:38.068] [11.0] Munge credential signature
plugin loaded<br>
[2020-11-10T11:21:38.068] [11.0] debug: job_container none
plugin loaded<br>
[2020-11-10T11:21:38.068] [11.0] debug: mpi type = pmi2<br>
[2020-11-10T11:21:38.068] [11.0] debug: xcgroup_instantiate:
cgroup '/sys/fs/cgroup/freezer/slurm' already exists<br>
[2020-11-10T11:21:38.068] [11.0] debug: spank: opening plugin
stack /etc/slurm/plugstack.conf<br>
[2020-11-10T11:21:38.068] [11.0] debug: mpi type = (null)<br>
[2020-11-10T11:21:38.068] [11.0] debug: using mpi/pmi2<br>
[2020-11-10T11:21:38.068] [11.0] debug: _setup_stepd_job_info:
SLURM_STEP_RESV_PORTS not found in env<br>
[2020-11-10T11:21:38.068] [11.0] debug: mpi/pmi2: setup sockets<br>
[2020-11-10T11:21:38.069] [11.0] debug: mpi/pmi2: started agent
thread<br>
[2020-11-10T11:21:38.069] [11.0]<b> error: connect io: No route
to host</b><br>
[2020-11-10T11:21:38.069] [11.0] error: IO setup failed: No
route to host<br>
[2020-11-10T11:21:38.069] [11.0] debug:
step_terminate_monitor_stop signaling condition<br>
[2020-11-10T11:21:38.069] [11.0] error: job_manager exiting
abnormally, rc = 4021<br>
[2020-11-10T11:21:38.069] [11.0] debug: Sending launch resp
rc=4021<br>
[2020-11-10T11:21:38.069] [11.0] debug: _send_srun_resp_msg:
0/5 <b>failed to send msg type 6002: No route to host</b><br>
[2020-11-10T11:21:38.169] [11.0] debug: _send_srun_resp_msg:
1/5 failed to send msg type 6002: No route to host<br>
[2020-11-10T11:21:38.370] [11.0] debug: _send_srun_resp_msg:
2/5 failed to send msg type 6002: No route to host<br>
[2020-11-10T11:21:38.770] [11.0] debug: _send_srun_resp_msg:
3/5 failed to send msg type 6002: No route to host<br>
[2020-11-10T11:21:39.570] [11.0] debug: _send_srun_resp_msg:
4/5 failed to send msg type 6002: No route to host<br>
[2020-11-10T11:21:40.370] [11.0] debug: _send_srun_resp_msg:
5/5 failed to send msg type 6002: No route to host<br>
[2020-11-10T11:21:40.372] [11.0] debug: Message thread exited<br>
[2020-11-10T11:21:40.372] [11.0] debug: mpi/pmi2: agent thread
exit<br>
[2020-11-10T11:21:40.372] [11.0] <b>done with job</b></blockquote>
<p><br>
</p>
<p>But I do not understand what this "No route to host" means.</p>
<p><br>
</p>
<p>Thanks for your help.<br>
</p>
<p>Patrick<br>
</p>
<p><br>
</p>
</blockquote>
<p><br>
</p>
</body>
</html>