[slurm-users] failed to send msg type 6002: No route to host

Brian Andrus toomuchit at gmail.com
Tue Nov 10 23:41:03 UTC 2020


This looks like it may be trying to do something using mpi.

What does your slurm.conf look like for that node?

Brian Andrus

On 11/10/2020 2:54 AM, Patrick Bégou wrote:
>
> Hi,
>
> I'm new to slurm (as admin) and I need some help. Testing my initial 
> setup with:
>
>     [begou at tenibre ~]$ *salloc -n 1 sh*
>     salloc: Granted job allocation 11
>     sh-4.4$ *squeue*
>                  JOBID PARTITION     NAME     USER ST       TIME NODES
>     NODELIST(REASON)
>     *11 *      all       sh    begou  R 0:16      1 tenibre-0-0
>     sh-4.4$*srun /usr/bin/hostname*
>     srun: error: timeout waiting for task launch, started 0 of 1 tasks
>     srun: Job step 11.0 aborted before step completely launched.
>     srun: Job step aborted: Waiting up to 32 seconds for job step to
>     finish.
>     srun: error: Timed out waiting for job step to complete
>
> I check the connections:
>
> *tenibre is the login node* (no daemon running)
>
>     nc -v tenibre-0-0 6818
>     nc -v management1 6817
>
> *management1 is the management node* (slurmctld running)
>
>     nc -v tenibre-0-0 6818
>
> *tenibre-0-0 is the first compute node* (slurmd running)
>
>     nc -v management1 6817
>
> All tests return "/Ncat: Connected.../"
>
> The command "id begou" works on all nodes and I can reach my home 
> directory on the login node and on the compute node.
>
> On the compute node slurmd.log shows:
>
>     [2020-11-10T11:21:38.050]*launch task* *11.0 *request from
>     UID:23455 GID:1036 HOST:172.30.1.254 PORT:42220
>     [2020-11-10T11:21:38.050] debug:  Checking credential with 508
>     bytes of sig data
>     [2020-11-10T11:21:38.050] _run_prolog: run job script took usec=12
>     [2020-11-10T11:21:38.050] _run_prolog: prolog with lock for job 11
>     ran for 0 seconds
>     [2020-11-10T11:21:38.053] debug:  AcctGatherEnergy NONE plugin loaded
>     [2020-11-10T11:21:38.053] debug:  AcctGatherProfile NONE plugin loaded
>     [2020-11-10T11:21:38.053] debug:  AcctGatherInterconnect NONE
>     plugin loaded
>     [2020-11-10T11:21:38.053] debug:  AcctGatherFilesystem NONE plugin
>     loaded
>     [2020-11-10T11:21:38.053] debug:  switch NONE plugin loaded
>     [2020-11-10T11:21:38.054] [11.0] debug:  Job accounting gather
>     NOT_INVOKED plugin loaded
>     [2020-11-10T11:21:38.054] [11.0] debug:  Message thread started
>     pid = 12099
>     [2020-11-10T11:21:38.054] debug: task_p_slurmd_reserve_resources: 11 0
>     [2020-11-10T11:21:38.068] [11.0] debug:  task NONE plugin loaded
>     [2020-11-10T11:21:38.068] [11.0] debug:  Checkpoint plugin loaded:
>     checkpoint/none
>     [2020-11-10T11:21:38.068] [11.0] Munge credential signature plugin
>     loaded
>     [2020-11-10T11:21:38.068] [11.0] debug:  job_container none plugin
>     loaded
>     [2020-11-10T11:21:38.068] [11.0] debug:  mpi type = pmi2
>     [2020-11-10T11:21:38.068] [11.0] debug:  xcgroup_instantiate:
>     cgroup '/sys/fs/cgroup/freezer/slurm' already exists
>     [2020-11-10T11:21:38.068] [11.0] debug:  spank: opening plugin
>     stack /etc/slurm/plugstack.conf
>     [2020-11-10T11:21:38.068] [11.0] debug:  mpi type = (null)
>     [2020-11-10T11:21:38.068] [11.0] debug:  using mpi/pmi2
>     [2020-11-10T11:21:38.068] [11.0] debug:  _setup_stepd_job_info:
>     SLURM_STEP_RESV_PORTS not found in env
>     [2020-11-10T11:21:38.068] [11.0] debug:  mpi/pmi2: setup sockets
>     [2020-11-10T11:21:38.069] [11.0] debug:  mpi/pmi2: started agent
>     thread
>     [2020-11-10T11:21:38.069] [11.0]*error: connect io: No route to host*
>     [2020-11-10T11:21:38.069] [11.0] error: IO setup failed: No route
>     to host
>     [2020-11-10T11:21:38.069] [11.0] debug:
>     step_terminate_monitor_stop signaling condition
>     [2020-11-10T11:21:38.069] [11.0] error: job_manager exiting
>     abnormally, rc = 4021
>     [2020-11-10T11:21:38.069] [11.0] debug:  Sending launch resp rc=4021
>     [2020-11-10T11:21:38.069] [11.0] debug:  _send_srun_resp_msg: 0/5
>     *failed to send msg type 6002: No route to host*
>     [2020-11-10T11:21:38.169] [11.0] debug:  _send_srun_resp_msg: 1/5
>     failed to send msg type 6002: No route to host
>     [2020-11-10T11:21:38.370] [11.0] debug:  _send_srun_resp_msg: 2/5
>     failed to send msg type 6002: No route to host
>     [2020-11-10T11:21:38.770] [11.0] debug:  _send_srun_resp_msg: 3/5
>     failed to send msg type 6002: No route to host
>     [2020-11-10T11:21:39.570] [11.0] debug:  _send_srun_resp_msg: 4/5
>     failed to send msg type 6002: No route to host
>     [2020-11-10T11:21:40.370] [11.0] debug:  _send_srun_resp_msg: 5/5
>     failed to send msg type 6002: No route to host
>     [2020-11-10T11:21:40.372] [11.0] debug:  Message thread exited
>     [2020-11-10T11:21:40.372] [11.0] debug:  mpi/pmi2: agent thread exit
>     [2020-11-10T11:21:40.372] [11.0] *done with job*
>
>
> But I do not understand what this "No route to host" means.
>
>
> Thanks for your help.
>
> Patrick
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201110/fb192d89/attachment.htm>


More information about the slurm-users mailing list