[slurm-users] [SOLVED] failed to send msg type 6002: No route to host
Patrick Bégou
Patrick.Begou at legi.grenoble-inp.fr
Mon Nov 16 13:28:00 UTC 2020
Hi,
just to give a feedback about this problem: it was a firewall problem.
When using "salloc/srun" from the login node, the login node must accept
connections from the compute nodes even if nor slurmd nor slucrmctld are
running on this login node.
I had to widely open the firewall on the internal interface of the login
node as I do not know which [range of] ports are required.
May be a short sentence in the "Quick Start Administrator Guide" about
firewall requirements would be nice for new slurm admins (and save time).
Thanks all for trying to help me solving this strange problem.
Patrick
Le 12/11/2020 à 13:37, Patrick Bégou a écrit :
> Hi slurm admins and developpers,
>
> no one has an idea about this problem ?
>
> Still investigating this morning I discover that it works from the
> management node (a small VM running slurmctld) even if I have no home
> directory on it (I use a su command from root to gain unprivileged
> user setup). It still doesn't run from the login node even with all
> firewall disabled :-(
>
> Patrick
>
> Le 10/11/2020 à 11:54, Patrick Bégou a écrit :
>>
>> Hi,
>>
>> I'm new to slurm (as admin) and I need some help. Testing my initial
>> setup with:
>>
>> [begou at tenibre ~]$ *salloc -n 1 sh*
>> salloc: Granted job allocation 11
>> sh-4.4$ *squeue*
>> JOBID PARTITION NAME USER ST TIME
>> NODES NODELIST(REASON)
>> *11 * all sh begou R
>> 0:16 1 tenibre-0-0
>> sh-4.4$*srun /usr/bin/hostname*
>> srun: error: timeout waiting for task launch, started 0 of 1 tasks
>> srun: Job step 11.0 aborted before step completely launched.
>> srun: Job step aborted: Waiting up to 32 seconds for job step to
>> finish.
>> srun: error: Timed out waiting for job step to complete
>>
>> I check the connections:
>>
>> *tenibre is the login node* (no daemon running)
>>
>> nc -v tenibre-0-0 6818
>> nc -v management1 6817
>>
>> *management1 is the management node* (slurmctld running)
>>
>> nc -v tenibre-0-0 6818
>>
>> *tenibre-0-0 is the first compute node* (slurmd running)
>>
>> nc -v management1 6817
>>
>> All tests return "/Ncat: Connected.../"
>>
>> The command "id begou" works on all nodes and I can reach my home
>> directory on the login node and on the compute node.
>>
>> On the compute node slurmd.log shows:
>>
>> [2020-11-10T11:21:38.050]*launch task* *11.0 *request from
>> UID:23455 GID:1036 HOST:172.30.1.254 PORT:42220
>> [2020-11-10T11:21:38.050] debug: Checking credential with 508
>> bytes of sig data
>> [2020-11-10T11:21:38.050] _run_prolog: run job script took usec=12
>> [2020-11-10T11:21:38.050] _run_prolog: prolog with lock for job
>> 11 ran for 0 seconds
>> [2020-11-10T11:21:38.053] debug: AcctGatherEnergy NONE plugin loaded
>> [2020-11-10T11:21:38.053] debug: AcctGatherProfile NONE plugin
>> loaded
>> [2020-11-10T11:21:38.053] debug: AcctGatherInterconnect NONE
>> plugin loaded
>> [2020-11-10T11:21:38.053] debug: AcctGatherFilesystem NONE
>> plugin loaded
>> [2020-11-10T11:21:38.053] debug: switch NONE plugin loaded
>> [2020-11-10T11:21:38.054] [11.0] debug: Job accounting gather
>> NOT_INVOKED plugin loaded
>> [2020-11-10T11:21:38.054] [11.0] debug: Message thread started
>> pid = 12099
>> [2020-11-10T11:21:38.054] debug:
>> task_p_slurmd_reserve_resources: 11 0
>> [2020-11-10T11:21:38.068] [11.0] debug: task NONE plugin loaded
>> [2020-11-10T11:21:38.068] [11.0] debug: Checkpoint plugin
>> loaded: checkpoint/none
>> [2020-11-10T11:21:38.068] [11.0] Munge credential signature
>> plugin loaded
>> [2020-11-10T11:21:38.068] [11.0] debug: job_container none
>> plugin loaded
>> [2020-11-10T11:21:38.068] [11.0] debug: mpi type = pmi2
>> [2020-11-10T11:21:38.068] [11.0] debug: xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/freezer/slurm' already exists
>> [2020-11-10T11:21:38.068] [11.0] debug: spank: opening plugin
>> stack /etc/slurm/plugstack.conf
>> [2020-11-10T11:21:38.068] [11.0] debug: mpi type = (null)
>> [2020-11-10T11:21:38.068] [11.0] debug: using mpi/pmi2
>> [2020-11-10T11:21:38.068] [11.0] debug: _setup_stepd_job_info:
>> SLURM_STEP_RESV_PORTS not found in env
>> [2020-11-10T11:21:38.068] [11.0] debug: mpi/pmi2: setup sockets
>> [2020-11-10T11:21:38.069] [11.0] debug: mpi/pmi2: started agent
>> thread
>> [2020-11-10T11:21:38.069] [11.0]*error: connect io: No route to host*
>> [2020-11-10T11:21:38.069] [11.0] error: IO setup failed: No route
>> to host
>> [2020-11-10T11:21:38.069] [11.0] debug:
>> step_terminate_monitor_stop signaling condition
>> [2020-11-10T11:21:38.069] [11.0] error: job_manager exiting
>> abnormally, rc = 4021
>> [2020-11-10T11:21:38.069] [11.0] debug: Sending launch resp rc=4021
>> [2020-11-10T11:21:38.069] [11.0] debug: _send_srun_resp_msg: 0/5
>> *failed to send msg type 6002: No route to host*
>> [2020-11-10T11:21:38.169] [11.0] debug: _send_srun_resp_msg: 1/5
>> failed to send msg type 6002: No route to host
>> [2020-11-10T11:21:38.370] [11.0] debug: _send_srun_resp_msg: 2/5
>> failed to send msg type 6002: No route to host
>> [2020-11-10T11:21:38.770] [11.0] debug: _send_srun_resp_msg: 3/5
>> failed to send msg type 6002: No route to host
>> [2020-11-10T11:21:39.570] [11.0] debug: _send_srun_resp_msg: 4/5
>> failed to send msg type 6002: No route to host
>> [2020-11-10T11:21:40.370] [11.0] debug: _send_srun_resp_msg: 5/5
>> failed to send msg type 6002: No route to host
>> [2020-11-10T11:21:40.372] [11.0] debug: Message thread exited
>> [2020-11-10T11:21:40.372] [11.0] debug: mpi/pmi2: agent thread exit
>> [2020-11-10T11:21:40.372] [11.0] *done with job*
>>
>>
>> But I do not understand what this "No route to host" means.
>>
>>
>> Thanks for your help.
>>
>> Patrick
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201116/ddf17289/attachment.htm>
More information about the slurm-users
mailing list