[slurm-users] failed to send msg type 6002: No route to host
Sean Maxwell
stm at case.edu
Thu Nov 12 13:06:26 UTC 2020
Hi Patrick,
I have seen a similar error while configuring native X-forwarding in Slurm.
It was caused by Slurm sending an IP to the compute node (as part of a
message) that was not routable back to the controller host. In my case it
was because the controller host was multihomed, and I had misconfigured
ControlMachine= in slurm.conf to a hostname associated with the wrong
network interface. If your controller host has multiple network interfaces,
you might want to check that all IPs associated with the controller have
routes back from the compute node.
-Sean
On Thu, Nov 12, 2020 at 7:40 AM Patrick Bégou <
Patrick.Begou at legi.grenoble-inp.fr> wrote:
> Hi slurm admins and developpers,
>
> no one has an idea about this problem ?
>
> Still investigating this morning I discover that it works from the
> management node (a small VM running slurmctld) even if I have no home
> directory on it (I use a su command from root to gain unprivileged user
> setup). It still doesn't run from the login node even with all firewall
> disabled :-(
>
> Patrick
>
> Le 10/11/2020 à 11:54, Patrick Bégou a écrit :
>
> Hi,
>
> I'm new to slurm (as admin) and I need some help. Testing my initial setup
> with:
>
> [begou at tenibre ~]$ *salloc -n 1 sh*
> salloc: Granted job allocation 11
> sh-4.4$ *squeue*
> JOBID PARTITION NAME USER ST TIME NODES
> NODELIST(REASON)
> *11 * all sh begou R 0:16 1
> tenibre-0-0
> sh-4.4$* srun /usr/bin/hostname*
> srun: error: timeout waiting for task launch, started 0 of 1 tasks
> srun: Job step 11.0 aborted before step completely launched.
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> srun: error: Timed out waiting for job step to complete
>
> I check the connections:
>
> *tenibre is the login node* (no daemon running)
>
> nc -v tenibre-0-0 6818
> nc -v management1 6817
>
> *management1 is the management node* (slurmctld running)
>
> nc -v tenibre-0-0 6818
>
> *tenibre-0-0 is the first compute node* (slurmd running)
>
> nc -v management1 6817
>
> All tests return "*Ncat: Connected...*"
>
> The command "id begou" works on all nodes and I can reach my home
> directory on the login node and on the compute node.
>
> On the compute node slurmd.log shows:
>
> [2020-11-10T11:21:38.050]* launch task* *11.0 *request from UID:23455
> GID:1036 HOST:172.30.1.254 PORT:42220
> [2020-11-10T11:21:38.050] debug: Checking credential with 508 bytes of
> sig data
> [2020-11-10T11:21:38.050] _run_prolog: run job script took usec=12
> [2020-11-10T11:21:38.050] _run_prolog: prolog with lock for job 11 ran for
> 0 seconds
> [2020-11-10T11:21:38.053] debug: AcctGatherEnergy NONE plugin loaded
> [2020-11-10T11:21:38.053] debug: AcctGatherProfile NONE plugin loaded
> [2020-11-10T11:21:38.053] debug: AcctGatherInterconnect NONE plugin loaded
> [2020-11-10T11:21:38.053] debug: AcctGatherFilesystem NONE plugin loaded
> [2020-11-10T11:21:38.053] debug: switch NONE plugin loaded
> [2020-11-10T11:21:38.054] [11.0] debug: Job accounting gather NOT_INVOKED
> plugin loaded
> [2020-11-10T11:21:38.054] [11.0] debug: Message thread started pid = 12099
> [2020-11-10T11:21:38.054] debug: task_p_slurmd_reserve_resources: 11 0
> [2020-11-10T11:21:38.068] [11.0] debug: task NONE plugin loaded
> [2020-11-10T11:21:38.068] [11.0] debug: Checkpoint plugin loaded:
> checkpoint/none
> [2020-11-10T11:21:38.068] [11.0] Munge credential signature plugin loaded
> [2020-11-10T11:21:38.068] [11.0] debug: job_container none plugin loaded
> [2020-11-10T11:21:38.068] [11.0] debug: mpi type = pmi2
> [2020-11-10T11:21:38.068] [11.0] debug: xcgroup_instantiate: cgroup
> '/sys/fs/cgroup/freezer/slurm' already exists
> [2020-11-10T11:21:38.068] [11.0] debug: spank: opening plugin stack
> /etc/slurm/plugstack.conf
> [2020-11-10T11:21:38.068] [11.0] debug: mpi type = (null)
> [2020-11-10T11:21:38.068] [11.0] debug: using mpi/pmi2
> [2020-11-10T11:21:38.068] [11.0] debug: _setup_stepd_job_info:
> SLURM_STEP_RESV_PORTS not found in env
> [2020-11-10T11:21:38.068] [11.0] debug: mpi/pmi2: setup sockets
> [2020-11-10T11:21:38.069] [11.0] debug: mpi/pmi2: started agent thread
> [2020-11-10T11:21:38.069] [11.0]* error: connect io: No route to host*
> [2020-11-10T11:21:38.069] [11.0] error: IO setup failed: No route to host
> [2020-11-10T11:21:38.069] [11.0] debug: step_terminate_monitor_stop
> signaling condition
> [2020-11-10T11:21:38.069] [11.0] error: job_manager exiting abnormally, rc
> = 4021
> [2020-11-10T11:21:38.069] [11.0] debug: Sending launch resp rc=4021
> [2020-11-10T11:21:38.069] [11.0] debug: _send_srun_resp_msg: 0/5 *failed
> to send msg type 6002: No route to host*
> [2020-11-10T11:21:38.169] [11.0] debug: _send_srun_resp_msg: 1/5 failed
> to send msg type 6002: No route to host
> [2020-11-10T11:21:38.370] [11.0] debug: _send_srun_resp_msg: 2/5 failed
> to send msg type 6002: No route to host
> [2020-11-10T11:21:38.770] [11.0] debug: _send_srun_resp_msg: 3/5 failed
> to send msg type 6002: No route to host
> [2020-11-10T11:21:39.570] [11.0] debug: _send_srun_resp_msg: 4/5 failed
> to send msg type 6002: No route to host
> [2020-11-10T11:21:40.370] [11.0] debug: _send_srun_resp_msg: 5/5 failed
> to send msg type 6002: No route to host
> [2020-11-10T11:21:40.372] [11.0] debug: Message thread exited
> [2020-11-10T11:21:40.372] [11.0] debug: mpi/pmi2: agent thread exit
> [2020-11-10T11:21:40.372] [11.0] *done with job*
>
>
> But I do not understand what this "No route to host" means.
>
>
> Thanks for your help.
>
> Patrick
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201112/70219e97/attachment.htm>
More information about the slurm-users
mailing list