<div dir="ltr"><div>Hi Patrick,</div><div><br></div><div>I have seen a similar error while configuring native X-forwarding in Slurm. It was caused by Slurm sending an IP to the compute node (as part of a message) that was not routable back to the controller host. In my case it was because the controller host was multihomed, and I had misconfigured ControlMachine= in slurm.conf to a hostname associated with the wrong network interface. If your controller host has multiple network interfaces, you might want to check that all IPs associated with the controller have routes back from the compute node.</div><div><br></div><div>-Sean<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Nov 12, 2020 at 7:40 AM Patrick Bégou <<a href="mailto:Patrick.Begou@legi.grenoble-inp.fr">Patrick.Begou@legi.grenoble-inp.fr</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  
  <div>

    <div>Hi slurm admins and developpers,</div>

    <div><br>

    </div>

    <div>no one has an idea about this problem ?</div>

    <div><br>

    </div>

    <div>Still investigating this morning I

      discover that it works from the management node (a small VM

      running slurmctld) even if I have no home directory on it (I use a

      su command from root to gain unprivileged user setup). It still

      doesn't run from the login node even with all firewall disabled

      :-( <br>

    </div>

    <div><br>

    </div>

    <div>Patrick<br>

    </div>

    <div><br>

    </div>

    <div>Le 10/11/2020 à 11:54, Patrick Bégou a

      écrit :<br>

    </div>

    <blockquote type="cite">

      
      <p>Hi,</p>

      <p>I'm new to slurm (as admin) and I need some help. Testing my

        initial setup with:</p>

      <blockquote>[begou@tenibre ~]$ <b>salloc -n 1 sh</b><br>

        salloc: Granted job allocation 11<br>

        sh-4.4$ <b>squeue</b><br>

                     JOBID PARTITION     NAME     USER ST       TIME 

        NODES NODELIST(REASON)<br>

                        <b>11 </b>      all       sh    begou  R      

        0:16      1 tenibre-0-0<br>

        sh-4.4$<b> srun /usr/bin/hostname</b><br>

        srun: error: timeout waiting for task launch, started 0 of 1

        tasks<br>

        srun: Job step 11.0 aborted before step completely launched.<br>

        srun: Job step aborted: Waiting up to 32 seconds for job step to

        finish.<br>

        srun: error: Timed out waiting for job step to complete<br>

      </blockquote>

      <p>I check the connections:<br>

      </p>

      <p><b>tenibre is the login node</b> (no daemon running)</p>

      <blockquote>nc -v tenibre-0-0 6818 <br>

        nc -v management1 6817<br>

      </blockquote>

      <b>management1 is the management node</b> (slurmctld running)<br>

      <blockquote>nc -v tenibre-0-0 6818<br>

      </blockquote>

      <b>tenibre-0-0 is the first compute node</b> (slurmd running)

      <blockquote>

        <p>nc -v management1 6817<br>

        </p>

      </blockquote>

      <p>All tests return "<i>Ncat: Connected...</i>"</p>

      <p>The command "id begou" works on all nodes and I can reach my

        home directory on the login node and on the compute node.</p>

      <p>On the compute node slurmd.log shows:</p>

      <blockquote>[2020-11-10T11:21:38.050]<b> launch task</b> <b>11.0

        </b>request from UID:23455 GID:1036 HOST:172.30.1.254 PORT:42220<br>

        [2020-11-10T11:21:38.050] debug:  Checking credential with 508

        bytes of sig data<br>

        [2020-11-10T11:21:38.050] _run_prolog: run job script took

        usec=12<br>

        [2020-11-10T11:21:38.050] _run_prolog: prolog with lock for job

        11 ran for 0 seconds<br>

        [2020-11-10T11:21:38.053] debug:  AcctGatherEnergy NONE plugin

        loaded<br>

        [2020-11-10T11:21:38.053] debug:  AcctGatherProfile NONE plugin

        loaded<br>

        [2020-11-10T11:21:38.053] debug:  AcctGatherInterconnect NONE

        plugin loaded<br>

        [2020-11-10T11:21:38.053] debug:  AcctGatherFilesystem NONE

        plugin loaded<br>

        [2020-11-10T11:21:38.053] debug:  switch NONE plugin loaded<br>

        [2020-11-10T11:21:38.054] [11.0] debug:  Job accounting gather

        NOT_INVOKED plugin loaded<br>

        [2020-11-10T11:21:38.054] [11.0] debug:  Message thread started

        pid = 12099<br>

        [2020-11-10T11:21:38.054] debug: 

        task_p_slurmd_reserve_resources: 11 0<br>

        [2020-11-10T11:21:38.068] [11.0] debug:  task NONE plugin loaded<br>

        [2020-11-10T11:21:38.068] [11.0] debug:  Checkpoint plugin

        loaded: checkpoint/none<br>

        [2020-11-10T11:21:38.068] [11.0] Munge credential signature

        plugin loaded<br>

        [2020-11-10T11:21:38.068] [11.0] debug:  job_container none

        plugin loaded<br>

        [2020-11-10T11:21:38.068] [11.0] debug:  mpi type = pmi2<br>

        [2020-11-10T11:21:38.068] [11.0] debug:  xcgroup_instantiate:

        cgroup '/sys/fs/cgroup/freezer/slurm' already exists<br>

        [2020-11-10T11:21:38.068] [11.0] debug:  spank: opening plugin

        stack /etc/slurm/plugstack.conf<br>

        [2020-11-10T11:21:38.068] [11.0] debug:  mpi type = (null)<br>

        [2020-11-10T11:21:38.068] [11.0] debug:  using mpi/pmi2<br>

        [2020-11-10T11:21:38.068] [11.0] debug:  _setup_stepd_job_info:

        SLURM_STEP_RESV_PORTS not found in env<br>

        [2020-11-10T11:21:38.068] [11.0] debug:  mpi/pmi2: setup sockets<br>

        [2020-11-10T11:21:38.069] [11.0] debug:  mpi/pmi2: started agent

        thread<br>

        [2020-11-10T11:21:38.069] [11.0]<b> error: connect io: No route

          to host</b><br>

        [2020-11-10T11:21:38.069] [11.0] error: IO setup failed: No

        route to host<br>

        [2020-11-10T11:21:38.069] [11.0] debug: 

        step_terminate_monitor_stop signaling condition<br>

        [2020-11-10T11:21:38.069] [11.0] error: job_manager exiting

        abnormally, rc = 4021<br>

        [2020-11-10T11:21:38.069] [11.0] debug:  Sending launch resp

        rc=4021<br>

        [2020-11-10T11:21:38.069] [11.0] debug:  _send_srun_resp_msg:

        0/5 <b>failed to send msg type 6002: No route to host</b><br>

        [2020-11-10T11:21:38.169] [11.0] debug:  _send_srun_resp_msg:

        1/5 failed to send msg type 6002: No route to host<br>

        [2020-11-10T11:21:38.370] [11.0] debug:  _send_srun_resp_msg:

        2/5 failed to send msg type 6002: No route to host<br>

        [2020-11-10T11:21:38.770] [11.0] debug:  _send_srun_resp_msg:

        3/5 failed to send msg type 6002: No route to host<br>

        [2020-11-10T11:21:39.570] [11.0] debug:  _send_srun_resp_msg:

        4/5 failed to send msg type 6002: No route to host<br>

        [2020-11-10T11:21:40.370] [11.0] debug:  _send_srun_resp_msg:

        5/5 failed to send msg type 6002: No route to host<br>

        [2020-11-10T11:21:40.372] [11.0] debug:  Message thread exited<br>

        [2020-11-10T11:21:40.372] [11.0] debug:  mpi/pmi2: agent thread

        exit<br>

        [2020-11-10T11:21:40.372] [11.0] <b>done with job</b></blockquote>

      <p><br>

      </p>

      <p>But I do not understand what this "No route to host" means.</p>

      <p><br>

      </p>

      <p>Thanks for your help.<br>

      </p>

      <p>Patrick<br>

      </p>

      <p><br>

      </p>

    </blockquote>

    <p><br>

    </p>

  </div>


</blockquote></div>