<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    We have a user who wants to run multiple instances of a single

    process job across a cluster, using a loop like <br>

    <p> ----- <br>

      for N in $nodelist; do <br>

           srun -w $N program & <br>

      done <br>

      wait <br>

      ----- <br>

    </p>

    <p> This works up to a thousand nodes or so (jobs are allocated by

      node here), but as the number of jobs submitted increases, we

      periodically see a variety of different error messages, such as <br>

    </p>

    <ul>

      <p> </p>

      <li> srun: error: Ignoring job_complete for job 100035 because our

        job ID is 102937 <br>

      </li>

      <li> srun: error: io_init_msg_read too small <br>

      </li>

      <li> srun: error: task 0 launch failed: Unspecified error <br>

      </li>

      <li> srun: error: Unable to allocate resources: Job/step already

        completing or completed <br>

      </li>

      <li> srun: error: Unable to allocate resources: No error <br>

      </li>

      <li> srun: error: unpack error in io_init_msg_unpack <br>

      </li>

      <li> srun: Job step 211042.0 aborted before step completely

        launched. <br>

      </li>

    </ul>

    <p> We have tried setting <br>

    </p>

    <ul>

      ulimit -n 500000 <br>

      ulimit -u 64000 <br>

    </ul>

    but that wasn't sufficient. <br>

    <p> The environment: <br>

    </p>

    <ul>

      <li> CentOS 7.3 (x86_64) <br>

      </li>

      <li> Slurm 17.11.0 <br>

      </li>

    </ul>

    <p> Does this ring any bells? Any thoughts about how we should

      proceed?<br>

    </p>

    Andy

    <pre class="moz-signature" cols="72">-- 

Andy Riebs

<a class="moz-txt-link-abbreviated" href="mailto:andy.riebs@hpe.com">andy.riebs@hpe.com</a>

Hewlett-Packard Enterprise

High Performance Computing Software Engineering

+1 404 648 9024

My opinions are not necessarily those of HPE

    May the source be with you!

</pre>

  </body>

</html>