[slurm-users] failed to send msg type 6002: No route to host

Thu Nov 12 15:07:13 UTC 2020

Hi Marcus

thanks for your contact. I'm new to slurm deployment and I do not
remember where I found this command to check slurm setup. The
SallocDefaultCommand is not defined in my slurm.conf file

That is strange for me is that it works on the node hosting slurmctld,
and on the compute node too.

On the compute node, connected as root and then using "su - begou":

    [root at tenibre-0-0 ~]# *su - begou*
    Last login: Tue Nov 10 20:49:45 CET 2020 on pts/0
    [begou at tenibre-0-0 ~]$ *sinfo*
    PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST
    equipment_typeC    up   infinite      1   idle tenibre-0-0
    all*               up   infinite      1   idle tenibre-0-0
    [begou at tenibre-0-0 ~]$ *squeue*
                 JOBID PARTITION     NAME     USER ST       TIME  NODES
    NODELIST(REASON)
    [begou at tenibre-0-0 ~]$ *salloc -n 1 srun hostname *
    salloc: Granted job allocation 45
    tenibre-0-0
    salloc: Relinquishing job allocation 45
    [begou at tenibre-0-0 ~]$

On the management node, connected as root and then using "su - begou"
(with no home directory available):

    [root at management1 ~]# *su - begou*
    Creating home directory for begou.
    Last login: Thu Nov 12 12:43:47 CET 2020 on pts/1
    su: warning: cannot change directory to /HA/sources/begou: No such
    file or directory
    [begou at management1 root]$ *sinfo*
    PARTITION       AVAIL  TIMELIMIT  NODES  STATE NODELIST
    equipment_typeC    up   infinite      1   idle tenibre-0-0
    all*               up   infinite      1   idle tenibre-0-0
    [begou at management1 root]$ *squeue*
                 JOBID PARTITION     NAME     USER ST       TIME  NODES
    NODELIST(REASON)
    [begou at management1 root]$ *salloc -n 1 srun hostname *
    salloc: Granted job allocation 46
    slurmstepd: error: couldn't chdir to `/root': Permission denied:
    going to /tmp instead
    tenibre-0-0
    salloc: Relinquishing job allocation 46
    [begou at management1 root]$

But not on the login node where I need it....

Le 12/11/2020 à 14:05, Marcus Wagner a écrit :
>
> for me at least, this is running as expected.
>
> I'm not sure, why you use "sh" as the command for salloc, I never saw
> that before. If you do not provide a command, the users default shell
> will be started if the "SallocDefaultCommand" is not set within
> slurm.conf
> So, what does
> $> salloc -n 1
> $> srun hostname
*This command hangs**
*
> **
> and what does
> $> salloc -n 1 srun hostname
>
*this command hangs too* from the login node.*
*
> **
> Best
> Marcus
>
>
> P.S.:
>
> increase debugging might also help, e.g.
>
> $> srun -vvvvv hostname
>
Yes I try this but wasn't able to find pertinent information. *This is
what I get*:

    [begou at tenibre ~]$ *salloc -n 1 "srun  -vvvvv hostname"*
    salloc: Granted job allocation 43
    salloc: error: _fork_command: Unable to find command "srun  -vvvvv
    hostname"
    salloc: Relinquishing job allocation 43
    [begou at tenibre ~]$ salloc -n 1 srun  -vvvvv hostname
    salloc: Granted job allocation 44
    srun: defined options
    srun: -------------------- --------------------
    srun: (null)              : tenibre-0-0
    srun: jobid               : 44
    srun: job-name            : srun
    srun: nodes               : 1
    srun: ntasks              : 1
    srun: verbose             : 5
    srun: -------------------- --------------------
    srun: end of defined options
    srun: debug:  propagating RLIMIT_CPU=18446744073709551615
    srun: debug:  propagating RLIMIT_FSIZE=18446744073709551615
    srun: debug:  propagating RLIMIT_DATA=18446744073709551615
    srun: debug:  propagating RLIMIT_STACK=8388608
    srun: debug:  propagating RLIMIT_CORE=18446744073709551615
    srun: debug:  propagating RLIMIT_RSS=18446744073709551615
    srun: debug:  propagating RLIMIT_NPROC=512946
    srun: debug:  propagating RLIMIT_NOFILE=1024
    srun: debug:  propagating RLIMIT_MEMLOCK=65536
    srun: debug:  propagating RLIMIT_AS=18446744073709551615
    srun: debug:  propagating SLURM_PRIO_PROCESS=0
    srun: debug:  propagating UMASK=0022
    srun: debug2: srun PMI messages to port=44969
    srun: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so
    srun: debug:  Munge authentication plugin loaded
    srun: debug3: Success.
    srun: jobid 44: nodes(1):`tenibre-0-0', cpu counts: 1(x1)
    srun: debug2: creating job with 1 tasks
    srun: debug:  requesting job 44, user 23455, nodes 1 including ((null))
    srun: debug:  cpus 1, tasks 1, name hostname, relative 65534
    srun: CpuBindType=(null type)
    srun: debug:  Entering slurm_step_launch
    srun: debug:  mpi type = (null)
    srun: debug:  Using mpi/none
    srun: debug:  Entering _msg_thr_create()
    srun: debug4: eio: handling events for 2 objects
    srun: debug3: eio_message_socket_readable: shutdown 0 fd 10
    srun: debug3: eio_message_socket_readable: shutdown 0 fd 6
    srun: debug:  initialized stdio listening socket, port 34531
    srun: debug:  Started IO server thread (139644034881280)
    srun: debug:  Entering _launch_tasks
    srun: debug3: IO thread pid = 1733164
    srun: debug4: eio: handling events for 4 objects
    srun: launching 44.0 on host tenibre-0-0, 1 tasks: 0
    srun: debug3: uid:23455 gid:1036 cwd:/HA/sources/begou 0
    srun: debug2: Called _file_readable
    srun: debug3:   false, all ioservers not yet initialized
    srun: debug2: Called _file_writable
    srun: debug3:   false
    srun: debug3: Trying to load plugin /usr/lib64/slurm/route_default.so
    srun: debug3:   eof is false
    srun: debug2: Called _file_writable
    srun: debug3:   false
    srun: debug3:   eof is false
    srun: debug3: Called _listening_socket_readable
    srun: route default plugin loaded
    srun: debug3: Success.
    srun: debug2: Tree head got back 0 looking for 1
    srun: debug3: Tree sending to tenibre-0-0
    srun: debug4: orig_timeout was 20000 we have 0 steps and a timeout
    of 20000
    srun: debug2: Tree head got back 1
    srun: debug:  launch returned msg_rc=0 err=0 type=8001
    srun: error: timeout waiting for task launch, started 0 of 1 tasks
    srun: Job step 44.0 aborted before step completely launched.
    srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
    srun: error: Timed out waiting for job step to complete
    srun: debug4: eio: handling events for 2 objects
    srun: debug3: eio_message_socket_readable: shutdown 1 fd 10
    srun: debug2:   false, shutdown
    srun: debug3: eio_message_socket_readable: shutdown 1 fd 6
    srun: debug2:   false, shutdown
    srun: debug4: eio: handling events for 4 objects
    srun: debug2: Called _file_readable
    srun: debug3:   false, shutdown
    srun: debug2: Called _file_writable
    srun: debug3:   false
    srun: debug3:   eof is false
    srun: debug2: Called _file_writable
    srun: debug3:   false
    srun: debug3:   eof is false
    srun: debug3: Called _listening_socket_readable
    srun: debug2:   false, shutdown
    srun: debug:  IO thread exiting
    salloc: Relinquishing job allocation 44
    [begou at tenibre ~]$

This problem looks really strange for me....

Patrick

> Am 10.11.2020 um 11:54 schrieb Patrick Bégou:
>> Hi,
>>
>> I'm new to slurm (as admin) and I need some help. Testing my initial
>> setup with:
>>
>>     [begou at tenibre ~]$ *salloc -n 1 sh*
>>     salloc: Granted job allocation 11
>>     sh-4.4$ *squeue*
>>                   JOBID PARTITION     NAME     USER ST       TIME
>> NODES NODELIST(REASON)
>>     *11 *      all       sh    begou  R 0:16      1 tenibre-0-0
>>     sh-4.4$*srun /usr/bin/hostname*
>>     srun: error: timeout waiting for task launch, started 0 of 1 tasks
>>     srun: Job step 11.0 aborted before step completely launched.
>>     srun: Job step aborted: Waiting up to 32 seconds for job step to
>> finish.
>>     srun: error: Timed out waiting for job step to complete
>>
>> I check the connections:
>>
>> *tenibre is the login node* (no daemon running)
>>
>>     nc -v tenibre-0-0 6818
>>     nc -v management1 6817
>>
>> *management1 is the management node* (slurmctld running)
>>
>>     nc -v tenibre-0-0 6818
>>
>> *tenibre-0-0 is the first compute node* (slurmd running)
>>
>>     nc -v management1 6817
>>
>> All tests return "/Ncat: Connected.../"
>>
>> The command "id begou" works on all nodes and I can reach my home
>> directory on the login node and on the compute node.
>>
>> On the compute node slurmd.log shows:
>>
>>     [2020-11-10T11:21:38.050]*launch task* *11.0 *request from
>> UID:23455 GID:1036 HOST:172.30.1.254 PORT:42220
>>     [2020-11-10T11:21:38.050] debug:  Checking credential with 508
>> bytes of sig data
>>     [2020-11-10T11:21:38.050] _run_prolog: run job script took usec=12
>>     [2020-11-10T11:21:38.050] _run_prolog: prolog with lock for job
>> 11 ran for 0 seconds
>>     [2020-11-10T11:21:38.053] debug:  AcctGatherEnergy NONE plugin
>> loaded
>>     [2020-11-10T11:21:38.053] debug:  AcctGatherProfile NONE plugin
>> loaded
>>     [2020-11-10T11:21:38.053] debug:  AcctGatherInterconnect NONE
>> plugin loaded
>>     [2020-11-10T11:21:38.053] debug:  AcctGatherFilesystem NONE
>> plugin loaded
>>     [2020-11-10T11:21:38.053] debug:  switch NONE plugin loaded
>>     [2020-11-10T11:21:38.054] [11.0] debug:  Job accounting gather
>> NOT_INVOKED plugin loaded
>>     [2020-11-10T11:21:38.054] [11.0] debug:  Message thread started
>> pid = 12099
>>     [2020-11-10T11:21:38.054] debug: 
>> task_p_slurmd_reserve_resources: 11 0
>>     [2020-11-10T11:21:38.068] [11.0] debug:  task NONE plugin loaded
>>     [2020-11-10T11:21:38.068] [11.0] debug:  Checkpoint plugin
>> loaded: checkpoint/none
>>     [2020-11-10T11:21:38.068] [11.0] Munge credential signature
>> plugin loaded
>>     [2020-11-10T11:21:38.068] [11.0] debug:  job_container none
>> plugin loaded
>>     [2020-11-10T11:21:38.068] [11.0] debug:  mpi type = pmi2
>>     [2020-11-10T11:21:38.068] [11.0] debug:  xcgroup_instantiate:
>> cgroup '/sys/fs/cgroup/freezer/slurm' already exists
>>     [2020-11-10T11:21:38.068] [11.0] debug:  spank: opening plugin
>> stack /etc/slurm/plugstack.conf
>>     [2020-11-10T11:21:38.068] [11.0] debug:  mpi type = (null)
>>     [2020-11-10T11:21:38.068] [11.0] debug:  using mpi/pmi2
>>     [2020-11-10T11:21:38.068] [11.0] debug:  _setup_stepd_job_info:
>> SLURM_STEP_RESV_PORTS not found in env
>>     [2020-11-10T11:21:38.068] [11.0] debug:  mpi/pmi2: setup sockets
>>     [2020-11-10T11:21:38.069] [11.0] debug:  mpi/pmi2: started agent
>> thread
>>     [2020-11-10T11:21:38.069] [11.0]*error: connect io: No route to
>> host*
>>     [2020-11-10T11:21:38.069] [11.0] error: IO setup failed: No route
>> to host
>>     [2020-11-10T11:21:38.069] [11.0] debug:
>> step_terminate_monitor_stop signaling condition
>>     [2020-11-10T11:21:38.069] [11.0] error: job_manager exiting
>> abnormally, rc = 4021
>>     [2020-11-10T11:21:38.069] [11.0] debug:  Sending launch resp rc=4021
>>     [2020-11-10T11:21:38.069] [11.0] debug:  _send_srun_resp_msg: 0/5
>> *failed to send msg type 6002: No route to host*
>>     [2020-11-10T11:21:38.169] [11.0] debug:  _send_srun_resp_msg: 1/5
>> failed to send msg type 6002: No route to host
>>     [2020-11-10T11:21:38.370] [11.0] debug:  _send_srun_resp_msg: 2/5
>> failed to send msg type 6002: No route to host
>>     [2020-11-10T11:21:38.770] [11.0] debug:  _send_srun_resp_msg: 3/5
>> failed to send msg type 6002: No route to host
>>     [2020-11-10T11:21:39.570] [11.0] debug:  _send_srun_resp_msg: 4/5
>> failed to send msg type 6002: No route to host
>>     [2020-11-10T11:21:40.370] [11.0] debug:  _send_srun_resp_msg: 5/5
>> failed to send msg type 6002: No route to host
>>     [2020-11-10T11:21:40.372] [11.0] debug:  Message thread exited
>>     [2020-11-10T11:21:40.372] [11.0] debug:  mpi/pmi2: agent thread exit
>>     [2020-11-10T11:21:40.372] [11.0] *done with job*
>>
>>
>> But I do not understand what this "No route to host" means.
>>
>>
>> Thanks for your help.
>>
>> Patrick
>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201112/19b2914f/attachment-0001.htm>