<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <p>Many thanks Matthieu!</p>

    <p>Andy<br>

    </p>

    <br>

    <div class="moz-cite-prefix">On 02/12/2018 06:42 PM, Matthieu

      Hautreux wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAChPGiBDo3i8Fz6PEqb2q9amd1m-A8Rw-65_=Vx4Wyd+-cvo+Q@mail.gmail.com">

      <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

      <div dir="ltr">

        <div>

          <div>

            <div>

              <div>

                <div>

                  <div>

                    <div>Hi,<br>

                      <br>

                    </div>

                    your login node may have a heavy load while starting

                    such a large number of independant sruns.<br>

                    <br>

                    This may induce issues not seen under normal load,

                    like partial read/write on sockets, triggering bugs

                    in slurm, for functions not properly protected

                    against such events.<br>

                    <br>

                  </div>

                  Quickly looking at the source code of the function

                  generating the "io_init_msg_read too small " message,

                  it seems that at least this one is not properly

                  protected against partial write :<br>

                  <br>

                   217 | int<br>

                   218 | io_init_msg_write_to_fd(int fd, struct

                  slurm_io_init_msg *msg)<br>

                   219 | {<br>

                   220 |         Buf buf;<br>

                   221 |         void *ptr;<br>

                   222 |         int n;<br>

                   223 | <br>

                   224 |         xassert(msg);<br>

                   225 | <br>

                   226 |         debug2("Entering

                  io_init_msg_write_to_fd");<br>

                   227 |         msg->version = IO_PROTOCOL_VERSION;<br>

                   228 |         buf =

                  init_buf(io_init_msg_packed_size());<br>

                   229 |         debug2("  msg->nodeid = %d",

                  msg->nodeid);<br>

                   230 |         io_init_msg_pack(msg, buf);<br>

                   231 | <br>

                   232 |         ptr = get_buf_data(buf);<br>

                   233 | again:<br>

                   234 | =>      if ((n = write(fd, ptr,

                  io_init_msg_packed_size())) < 0) {<br>

                   235 |                 if (errno == EINTR)<br>

                   236 |                         goto again;<br>

                   237 |                 free_buf(buf);<br>

                   238 |                 return SLURM_ERROR;<br>

                   239 |         }<br>

                   240 |         if (n != io_init_msg_packed_size()) {<br>

                   241 |                 error("io init msg write too

                  small");<br>

                   242 |                 free_buf(buf);<br>

                   243 |                 return SLURM_ERROR;<br>

                   244 |         }<br>

                   245 | <br>

                   246 |         free_buf(buf);<br>

                   247 |         debug2("Leaving 

                  io_init_msg_write_to_fd");<br>

                   248 |         return SLURM_SUCCESS;<br>

                   249 | }<br>

                  <br>

                </div>

                A proper way to handle partial write is the following

                (from somewhere else in Slurm codebase) :<br>

                <br>

                 188 | ssize_t fd_write_n(int fd, void *buf, size_t n)<br>

                 189 | {<br>

                 190 |         size_t nleft;<br>

                 191 |         ssize_t nwritten;<br>

                 192 |         unsigned char *p;<br>

                 193 | <br>

                 194 |         p = buf;<br>

                 195 |         nleft = n;<br>

                 196 |         while (nleft > 0) {<br>

                 197 | =>              if ((nwritten = write(fd, p,

                nleft)) < 0) {<br>

                 198 |                         if (errno == EINTR)<br>

                 199 |                                 continue;<br>

                 200 |                         else<br>

                 201 |                                 return(-1);<br>

                 202 |                 }<br>

                 203 |                 nleft -= nwritten;<br>

                 204 |                 p += nwritten;<br>

                 205 |         }<br>

                 206 |         return(n);<br>

                 207 | }<br>

                <br>

                <br>

              </div>

              It seems that some code cleaning/factoring could be

              performed in Slurm to limit risks of this kind of issues.

              Not sure that it would resolve your problem but at least

              it seems harmfull to still have that in the code.<br>

              <br>

            </div>

            You should file a bug for that.<br>

            <br>

          </div>

          HTH<br>

        </div>

        Matthieu<br>

        <br>

      </div>

      <div class="gmail_extra"><br>

        <div class="gmail_quote">2018-02-12 22:42 GMT+01:00 Andy Riebs <span

            dir="ltr"><<a href="mailto:andy.riebs@hpe.com"

              target="_blank" moz-do-not-send="true">andy.riebs@hpe.com</a>></span>:<br>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex">

            <div bgcolor="#FFFFFF" text="#000000"> We have a user who

              wants to run multiple instances of a single process job

              across a cluster, using a loop like <br>

              <p> ----- <br>

                for N in $nodelist; do <br>

                     srun -w $N program & <br>

                done <br>

                wait <br>

                ----- <br>

              </p>

              <p> This works up to a thousand nodes or so (jobs are

                allocated by node here), but as the number of jobs

                submitted increases, we periodically see a variety of

                different error messages, such as <br>

              </p>

              <ul>

                <p> </p>

                <li> srun: error: Ignoring job_complete for job 100035

                  because our job ID is 102937 <br>

                </li>

                <li> srun: error: io_init_msg_read too small <br>

                </li>

                <li> srun: error: task 0 launch failed: Unspecified

                  error <br>

                </li>

                <li> srun: error: Unable to allocate resources: Job/step

                  already completing or completed <br>

                </li>

                <li> srun: error: Unable to allocate resources: No error

                  <br>

                </li>

                <li> srun: error: unpack error in io_init_msg_unpack <br>

                </li>

                <li> srun: Job step 211042.0 aborted before step

                  completely launched. <br>

                </li>

              </ul>

              <p> We have tried setting <br>

              </p>

              <ul>

                ulimit -n 500000 <br>

                ulimit -u 64000 <br>

              </ul>

              but that wasn't sufficient. <br>

              <p> The environment: <br>

              </p>

              <ul>

                <li> CentOS 7.3 (x86_64) <br>

                </li>

                <li> Slurm 17.11.0 <br>

                </li>

              </ul>

              <p> Does this ring any bells? Any thoughts about how we

                should proceed?<span class="HOEnZb"><font

                    color="#888888"><br>

                  </font></span></p>

              <span class="HOEnZb"><font color="#888888"> Andy

                  <pre class="m_8358795028441754057moz-signature" cols="72">-- 

Andy Riebs

<a class="m_8358795028441754057moz-txt-link-abbreviated" href="mailto:andy.riebs@hpe.com" target="_blank" moz-do-not-send="true">andy.riebs@hpe.com</a>

Hewlett-Packard Enterprise

High Performance Computing Software Engineering

<a href="tel:%28404%29%20648-9024" value="+14046489024" target="_blank" moz-do-not-send="true">+1 404 648 9024</a>

My opinions are not necessarily those of HPE

    May the source be with you!

</pre>

                </font></span></div>

          </blockquote>

        </div>

        <br>

      </div>

    </blockquote>

    <br>

    <pre class="moz-signature" cols="72">-- 

Andy Riebs

<a class="moz-txt-link-abbreviated" href="mailto:andy.riebs@hpe.com">andy.riebs@hpe.com</a>

Hewlett-Packard Enterprise

High Performance Computing Software Engineering

+1 404 648 9024

My opinions are not necessarily those of HPE

    May the source be with you!

</pre>

  </body>

</html>