<html>

  <head>


    <meta http-equiv="content-type" content="text/html; charset=UTF-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <p>All,</p>

    <p>I have several clusters that are all connected to a standalone

      slurmdb server. They are not federated.</p>

    <p>I can check the queues and do everything from that system using

      -M <cluster> for most commands. However, if I try to get a

      shell within a job (eg: srun -M cluster-a --pty bash), it queues

      up, but when it tries to run, I get an error:</p>

    <div class="moz-text-html" lang="x-western"><font size="2"

        face="Calibri"><span style="font-size:11pt;">

          <div> </div>

        </span></font><font size="2" face="Calibri"><span

          style="font-size:11pt;">

          <div>$ srun -M clustera -n16 --pty bash</div>

          <div>srun: job 6 has been allocated resources</div>

          <div>srun: error: Error connecting, bad data: family = 2, port

            = 0</div>

          <div>srun: error: Task launch for 6.0 failed on node

            ip-0A312014: Communication connection failure</div>

          <div>srun: error: Application launch failed: Communication

            connection failure</div>

          <div>srun: Job step aborted: Waiting up to 32 seconds for job

            step to finish.</div>

          <div>srun: error: Timed out waiting for job step to complete</div>

          <div><br>

          </div>

          <div> </div>

          <div> </div>

        </span></font>And in the slurmctld log for the cluster master:</div>

    <div class="moz-text-html" lang="x-western"><span

        style="font-size:11pt;">

        <div>Dec 17 19:50:04 nastran-master slurmctld[54739]: error:

          slurm_receive_msg [10.49.32.20:44022]: Zero Bytes were

          transmitted or received</div>

        <div>Dec 17 19:50:07 nastran-master slurmctld[54739]: error:

          slurm_receive_msg [10.49.32.20:44046]: Zero Bytes were

          transmitted or received</div>

        <div>Dec 17 19:50:08 nastran-master slurmctld[54739]:

          update_node: node ip-0A312014 state set to DOWN</div>

        <div>Dec 17 19:50:08 nastran-master slurmctld[54739]: Node

          ip-0A312014 now responding</div>

        <div>Dec 17 19:50:08 nastran-master slurmctld[54739]: node

          ip-0A312014 returned to service</div>

        <div>Dec 17 19:50:09 nastran-master slurmctld[54739]: sched:

          Allocate JobId=6 NodeList=ip-0A312014 #CPUs=16 Partition=debug</div>

        <div>Dec 17 19:50:09 nastran-master slurmctld[54739]:

          job_step_signal JobId=6 StepId=0 not found</div>

        <div>Dec 17 19:50:41 nastran-master slurmctld[54739]:

          job_step_signal JobId=6 StepId=0 not found</div>

        <div>Dec 17 19:50:41 nastran-master slurmctld[54739]:

          _job_complete: JobId=6 WTERMSIG 105</div>

        <div>Dec 17 19:50:41 nastran-master slurmctld[54739]:

          _job_complete: JobId=6 done</div>

      </span></div>

    <div class="moz-text-html" lang="x-western"><br>

    </div>

    <p>Is this something that cannot be done from a system that is

      outside a federated cluster?</p>

    <p>Brian Andrus<br>

    </p>

  </body>

</html>