[slurm-users] Too many single-stream jobs?

Andy Riebs andy.riebs at hpe.com
Mon Feb 12 17:01:49 MST 2018


Many thanks Matthieu!

Andy


On 02/12/2018 06:42 PM, Matthieu Hautreux wrote:
> Hi,
>
> your login node may have a heavy load while starting such a large 
> number of independant sruns.
>
> This may induce issues not seen under normal load, like partial 
> read/write on sockets, triggering bugs in slurm, for functions not 
> properly protected against such events.
>
> Quickly looking at the source code of the function generating the 
> "io_init_msg_read too small " message, it seems that at least this one 
> is not properly protected against partial write :
>
>  217 | int
>  218 | io_init_msg_write_to_fd(int fd, struct slurm_io_init_msg *msg)
>  219 | {
>  220 |         Buf buf;
>  221 |         void *ptr;
>  222 |         int n;
>  223 |
>  224 |         xassert(msg);
>  225 |
>  226 |         debug2("Entering io_init_msg_write_to_fd");
>  227 |         msg->version = IO_PROTOCOL_VERSION;
>  228 |         buf = init_buf(io_init_msg_packed_size());
>  229 |         debug2("  msg->nodeid = %d", msg->nodeid);
>  230 |         io_init_msg_pack(msg, buf);
>  231 |
>  232 |         ptr = get_buf_data(buf);
>  233 | again:
>  234 | =>      if ((n = write(fd, ptr, io_init_msg_packed_size())) < 0) {
>  235 |                 if (errno == EINTR)
>  236 |                         goto again;
>  237 |                 free_buf(buf);
>  238 |                 return SLURM_ERROR;
>  239 |         }
>  240 |         if (n != io_init_msg_packed_size()) {
>  241 |                 error("io init msg write too small");
>  242 |                 free_buf(buf);
>  243 |                 return SLURM_ERROR;
>  244 |         }
>  245 |
>  246 |         free_buf(buf);
>  247 |         debug2("Leaving io_init_msg_write_to_fd");
>  248 |         return SLURM_SUCCESS;
>  249 | }
>
> A proper way to handle partial write is the following (from somewhere 
> else in Slurm codebase) :
>
>  188 | ssize_t fd_write_n(int fd, void *buf, size_t n)
>  189 | {
>  190 |         size_t nleft;
>  191 |         ssize_t nwritten;
>  192 |         unsigned char *p;
>  193 |
>  194 |         p = buf;
>  195 |         nleft = n;
>  196 |         while (nleft > 0) {
>  197 | =>              if ((nwritten = write(fd, p, nleft)) < 0) {
>  198 |                         if (errno == EINTR)
>  199 |                                 continue;
>  200 |                         else
>  201 |                                 return(-1);
>  202 |                 }
>  203 |                 nleft -= nwritten;
>  204 |                 p += nwritten;
>  205 |         }
>  206 |         return(n);
>  207 | }
>
>
> It seems that some code cleaning/factoring could be performed in Slurm 
> to limit risks of this kind of issues. Not sure that it would resolve 
> your problem but at least it seems harmfull to still have that in the 
> code.
>
> You should file a bug for that.
>
> HTH
> Matthieu
>
>
> 2018-02-12 22:42 GMT+01:00 Andy Riebs <andy.riebs at hpe.com 
> <mailto:andy.riebs at hpe.com>>:
>
>     We have a user who wants to run multiple instances of a single
>     process job across a cluster, using a loop like
>
>     -----
>     for N in $nodelist; do
>          srun -w $N program &
>     done
>     wait
>     -----
>
>     This works up to a thousand nodes or so (jobs are allocated by
>     node here), but as the number of jobs submitted increases, we
>     periodically see a variety of different error messages, such as
>
>       * srun: error: Ignoring job_complete for job 100035 because our
>         job ID is 102937
>       * srun: error: io_init_msg_read too small
>       * srun: error: task 0 launch failed: Unspecified error
>       * srun: error: Unable to allocate resources: Job/step already
>         completing or completed
>       * srun: error: Unable to allocate resources: No error
>       * srun: error: unpack error in io_init_msg_unpack
>       * srun: Job step 211042.0 aborted before step completely launched.
>
>     We have tried setting
>
>         ulimit -n 500000
>         ulimit -u 64000
>
>     but that wasn't sufficient.
>
>     The environment:
>
>       * CentOS 7.3 (x86_64)
>       * Slurm 17.11.0
>
>     Does this ring any bells? Any thoughts about how we should proceed?
>
>     Andy
>
>     -- 
>     Andy Riebs
>     andy.riebs at hpe.com <mailto:andy.riebs at hpe.com>
>     Hewlett-Packard Enterprise
>     High Performance Computing Software Engineering
>     +1 404 648 9024 <tel:%28404%29%20648-9024>
>     My opinions are not necessarily those of HPE
>          May the source be with you!
>
>

-- 
Andy Riebs
andy.riebs at hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
     May the source be with you!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180212/070d3a16/attachment-0001.html>


More information about the slurm-users mailing list