[slurm-users] Too many single-stream jobs?

Mon Feb 12 14:42:27 MST 2018

We have a user who wants to run multiple instances of a single process 
job across a cluster, using a loop like

-----
for N in $nodelist; do
      srun -w $N program &
done
wait
-----

This works up to a thousand nodes or so (jobs are allocated by node 
here), but as the number of jobs submitted increases, we periodically 
see a variety of different error messages, such as

  * srun: error: Ignoring job_complete for job 100035 because our job ID
    is 102937
  * srun: error: io_init_msg_read too small
  * srun: error: task 0 launch failed: Unspecified error
  * srun: error: Unable to allocate resources: Job/step already
    completing or completed
  * srun: error: Unable to allocate resources: No error
  * srun: error: unpack error in io_init_msg_unpack
  * srun: Job step 211042.0 aborted before step completely launched.

We have tried setting

    ulimit -n 500000
    ulimit -u 64000

but that wasn't sufficient.

The environment:

  * CentOS 7.3 (x86_64)
  * Slurm 17.11.0

Does this ring any bells? Any thoughts about how we should proceed?

Andy

-- 
Andy Riebs
andy.riebs at hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
     May the source be with you!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180212/bb9e4ca5/attachment.html>