[slurm-users] failed to send msg type 6002: No route to host
Patrick Bégou
Patrick.Begou at legi.grenoble-inp.fr
Thu Nov 12 15:07:13 UTC 2020
Hi Marcus
thanks for your contact. I'm new to slurm deployment and I do not
remember where I found this command to check slurm setup. The
SallocDefaultCommand is not defined in my slurm.conf file
That is strange for me is that it works on the node hosting slurmctld,
and on the compute node too.
On the compute node, connected as root and then using "su - begou":
[root at tenibre-0-0 ~]# *su - begou*
Last login: Tue Nov 10 20:49:45 CET 2020 on pts/0
[begou at tenibre-0-0 ~]$ *sinfo*
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
equipment_typeC up infinite 1 idle tenibre-0-0
all* up infinite 1 idle tenibre-0-0
[begou at tenibre-0-0 ~]$ *squeue*
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
[begou at tenibre-0-0 ~]$ *salloc -n 1 srun hostname *
salloc: Granted job allocation 45
tenibre-0-0
salloc: Relinquishing job allocation 45
[begou at tenibre-0-0 ~]$
On the management node, connected as root and then using "su - begou"
(with no home directory available):
[root at management1 ~]# *su - begou*
Creating home directory for begou.
Last login: Thu Nov 12 12:43:47 CET 2020 on pts/1
su: warning: cannot change directory to /HA/sources/begou: No such
file or directory
[begou at management1 root]$ *sinfo*
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
equipment_typeC up infinite 1 idle tenibre-0-0
all* up infinite 1 idle tenibre-0-0
[begou at management1 root]$ *squeue*
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
[begou at management1 root]$ *salloc -n 1 srun hostname *
salloc: Granted job allocation 46
slurmstepd: error: couldn't chdir to `/root': Permission denied:
going to /tmp instead
tenibre-0-0
salloc: Relinquishing job allocation 46
[begou at management1 root]$
But not on the login node where I need it....
Le 12/11/2020 à 14:05, Marcus Wagner a écrit :
>
> for me at least, this is running as expected.
>
> I'm not sure, why you use "sh" as the command for salloc, I never saw
> that before. If you do not provide a command, the users default shell
> will be started if the "SallocDefaultCommand" is not set within
> slurm.conf
> So, what does
> $> salloc -n 1
> $> srun hostname
*This command hangs**
*
> **
> and what does
> $> salloc -n 1 srun hostname
>
*this command hangs too* from the login node.*
*
> **
> Best
> Marcus
>
>
> P.S.:
>
> increase debugging might also help, e.g.
>
> $> srun -vvvvv hostname
>
Yes I try this but wasn't able to find pertinent information. *This is
what I get*:
[begou at tenibre ~]$ *salloc -n 1 "srun -vvvvv hostname"*
salloc: Granted job allocation 43
salloc: error: _fork_command: Unable to find command "srun -vvvvv
hostname"
salloc: Relinquishing job allocation 43
[begou at tenibre ~]$ salloc -n 1 srun -vvvvv hostname
salloc: Granted job allocation 44
srun: defined options
srun: -------------------- --------------------
srun: (null) : tenibre-0-0
srun: jobid : 44
srun: job-name : srun
srun: nodes : 1
srun: ntasks : 1
srun: verbose : 5
srun: -------------------- --------------------
srun: end of defined options
srun: debug: propagating RLIMIT_CPU=18446744073709551615
srun: debug: propagating RLIMIT_FSIZE=18446744073709551615
srun: debug: propagating RLIMIT_DATA=18446744073709551615
srun: debug: propagating RLIMIT_STACK=8388608
srun: debug: propagating RLIMIT_CORE=18446744073709551615
srun: debug: propagating RLIMIT_RSS=18446744073709551615
srun: debug: propagating RLIMIT_NPROC=512946
srun: debug: propagating RLIMIT_NOFILE=1024
srun: debug: propagating RLIMIT_MEMLOCK=65536
srun: debug: propagating RLIMIT_AS=18446744073709551615
srun: debug: propagating SLURM_PRIO_PROCESS=0
srun: debug: propagating UMASK=0022
srun: debug2: srun PMI messages to port=44969
srun: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so
srun: debug: Munge authentication plugin loaded
srun: debug3: Success.
srun: jobid 44: nodes(1):`tenibre-0-0', cpu counts: 1(x1)
srun: debug2: creating job with 1 tasks
srun: debug: requesting job 44, user 23455, nodes 1 including ((null))
srun: debug: cpus 1, tasks 1, name hostname, relative 65534
srun: CpuBindType=(null type)
srun: debug: Entering slurm_step_launch
srun: debug: mpi type = (null)
srun: debug: Using mpi/none
srun: debug: Entering _msg_thr_create()
srun: debug4: eio: handling events for 2 objects
srun: debug3: eio_message_socket_readable: shutdown 0 fd 10
srun: debug3: eio_message_socket_readable: shutdown 0 fd 6
srun: debug: initialized stdio listening socket, port 34531
srun: debug: Started IO server thread (139644034881280)
srun: debug: Entering _launch_tasks
srun: debug3: IO thread pid = 1733164
srun: debug4: eio: handling events for 4 objects
srun: launching 44.0 on host tenibre-0-0, 1 tasks: 0
srun: debug3: uid:23455 gid:1036 cwd:/HA/sources/begou 0
srun: debug2: Called _file_readable
srun: debug3: false, all ioservers not yet initialized
srun: debug2: Called _file_writable
srun: debug3: false
srun: debug3: Trying to load plugin /usr/lib64/slurm/route_default.so
srun: debug3: eof is false
srun: debug2: Called _file_writable
srun: debug3: false
srun: debug3: eof is false
srun: debug3: Called _listening_socket_readable
srun: route default plugin loaded
srun: debug3: Success.
srun: debug2: Tree head got back 0 looking for 1
srun: debug3: Tree sending to tenibre-0-0
srun: debug4: orig_timeout was 20000 we have 0 steps and a timeout
of 20000
srun: debug2: Tree head got back 1
srun: debug: launch returned msg_rc=0 err=0 type=8001
srun: error: timeout waiting for task launch, started 0 of 1 tasks
srun: Job step 44.0 aborted before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete
srun: debug4: eio: handling events for 2 objects
srun: debug3: eio_message_socket_readable: shutdown 1 fd 10
srun: debug2: false, shutdown
srun: debug3: eio_message_socket_readable: shutdown 1 fd 6
srun: debug2: false, shutdown
srun: debug4: eio: handling events for 4 objects
srun: debug2: Called _file_readable
srun: debug3: false, shutdown
srun: debug2: Called _file_writable
srun: debug3: false
srun: debug3: eof is false
srun: debug2: Called _file_writable
srun: debug3: false
srun: debug3: eof is false
srun: debug3: Called _listening_socket_readable
srun: debug2: false, shutdown
srun: debug: IO thread exiting
salloc: Relinquishing job allocation 44
[begou at tenibre ~]$
This problem looks really strange for me....
Patrick
> Am 10.11.2020 um 11:54 schrieb Patrick Bégou:
>> Hi,
>>
>> I'm new to slurm (as admin) and I need some help. Testing my initial
>> setup with:
>>
>> [begou at tenibre ~]$ *salloc -n 1 sh*
>> salloc: Granted job allocation 11
>> sh-4.4$ *squeue*
>> JOBID PARTITION NAME USER ST TIME
>> NODES NODELIST(REASON)
>> *11 * all sh begou R 0:16 1 tenibre-0-0
>> sh-4.4$*srun /usr/bin/hostname*
>> srun: error: timeout waiting for task launch, started 0 of 1 tasks
>> srun: Job step 11.0 aborted before step completely launched.
>> srun: Job step aborted: Waiting up to 32 seconds for job step to
>> finish.
>> srun: error: Timed out waiting for job step to complete
>>
>> I check the connections:
>>
>> *tenibre is the login node* (no daemon running)
>>
>> nc -v tenibre-0-0 6818
>> nc -v management1 6817
>>
>> *management1 is the management node* (slurmctld running)
>>
>> nc -v tenibre-0-0 6818
>>
>> *tenibre-0-0 is the first compute node* (slurmd running)
>>
>> nc -v management1 6817
>>
>> All tests return "/Ncat: Connected.../"
>>
>> The command "id begou" works on all nodes and I can reach my home
>> directory on the login node and on the compute node.
>>
>> On the compute node slurmd.log shows:
>>
>> [2020-11-10T11:21:38.050]*launch task* *11.0 *request from
>> UID:23455 GID:1036 HOST:172.30.1.254 PORT:42220
>> [2020-11-10T11:21:38.050] debug: Checking credential with 508
>> bytes of sig data
>> [2020-11-10T11:21:38.050] _run_prolog: run job script took usec=12
>> [2020-11-10T11:21:38.050] _run_prolog: prolog with lock for job
>> 11 ran for 0 seconds
>> [2020-11-10T11:21:38.053] debug: AcctGatherEnergy NONE plugin
>> loaded
>> [2020-11-10T11:21:38.053] debug: AcctGatherProfile NONE plugin
>> loaded
>> [2020-11-10T11:21:38.053] debug: AcctGatherInterconnect NONE
>> plugin loaded
>> [2020-11-10T11:21:38.053] debug: AcctGatherFilesystem NONE
>> plugin loaded
>> [2020-11-10T11:21:38.053] debug: switch NONE plugin loaded
>> [2020-11-10T11:21:38.054] [11.0] debug: Job accounting gather
>> NOT_INVOKED plugin loaded
>> [2020-11-10T11:21:38.054] [11.0] debug: Message thread started
>> pid = 12099
>> [2020-11-10T11:21:38.054] debug:
>> task_p_slurmd_reserve_resources: 11 0
>> [2020-11-10T11:21:38.068] [11.0] debug: task NONE plugin loaded
>> [2020-11-10T11:21:38.068] [11.0] debug: Checkpoint plugin
>> loaded: checkpoint/none
>> [2020-11-10T11:21:38.068] [11.0] Munge credential signature
>> plugin loaded
>> [2020-11-10T11:21:38.068] [11.0] debug: job_container none
>> plugin loaded
>> [2020-11-10T11:21:38.068] [11.0] debug: mpi type = pmi2
>> [2020-11-10T11:21:38.068] [11.0] debug: xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/freezer/slurm' already exists
>> [2020-11-10T11:21:38.068] [11.0] debug: spank: opening plugin
>> stack /etc/slurm/plugstack.conf
>> [2020-11-10T11:21:38.068] [11.0] debug: mpi type = (null)
>> [2020-11-10T11:21:38.068] [11.0] debug: using mpi/pmi2
>> [2020-11-10T11:21:38.068] [11.0] debug: _setup_stepd_job_info:
>> SLURM_STEP_RESV_PORTS not found in env
>> [2020-11-10T11:21:38.068] [11.0] debug: mpi/pmi2: setup sockets
>> [2020-11-10T11:21:38.069] [11.0] debug: mpi/pmi2: started agent
>> thread
>> [2020-11-10T11:21:38.069] [11.0]*error: connect io: No route to
>> host*
>> [2020-11-10T11:21:38.069] [11.0] error: IO setup failed: No route
>> to host
>> [2020-11-10T11:21:38.069] [11.0] debug:
>> step_terminate_monitor_stop signaling condition
>> [2020-11-10T11:21:38.069] [11.0] error: job_manager exiting
>> abnormally, rc = 4021
>> [2020-11-10T11:21:38.069] [11.0] debug: Sending launch resp rc=4021
>> [2020-11-10T11:21:38.069] [11.0] debug: _send_srun_resp_msg: 0/5
>> *failed to send msg type 6002: No route to host*
>> [2020-11-10T11:21:38.169] [11.0] debug: _send_srun_resp_msg: 1/5
>> failed to send msg type 6002: No route to host
>> [2020-11-10T11:21:38.370] [11.0] debug: _send_srun_resp_msg: 2/5
>> failed to send msg type 6002: No route to host
>> [2020-11-10T11:21:38.770] [11.0] debug: _send_srun_resp_msg: 3/5
>> failed to send msg type 6002: No route to host
>> [2020-11-10T11:21:39.570] [11.0] debug: _send_srun_resp_msg: 4/5
>> failed to send msg type 6002: No route to host
>> [2020-11-10T11:21:40.370] [11.0] debug: _send_srun_resp_msg: 5/5
>> failed to send msg type 6002: No route to host
>> [2020-11-10T11:21:40.372] [11.0] debug: Message thread exited
>> [2020-11-10T11:21:40.372] [11.0] debug: mpi/pmi2: agent thread exit
>> [2020-11-10T11:21:40.372] [11.0] *done with job*
>>
>>
>> But I do not understand what this "No route to host" means.
>>
>>
>> Thanks for your help.
>>
>> Patrick
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201112/19b2914f/attachment-0001.htm>
More information about the slurm-users
mailing list