[slurm-users] Too many single-stream jobs?
Andy Riebs
andy.riebs at hpe.com
Mon Feb 12 14:42:27 MST 2018
We have a user who wants to run multiple instances of a single process
job across a cluster, using a loop like
-----
for N in $nodelist; do
srun -w $N program &
done
wait
-----
This works up to a thousand nodes or so (jobs are allocated by node
here), but as the number of jobs submitted increases, we periodically
see a variety of different error messages, such as
* srun: error: Ignoring job_complete for job 100035 because our job ID
is 102937
* srun: error: io_init_msg_read too small
* srun: error: task 0 launch failed: Unspecified error
* srun: error: Unable to allocate resources: Job/step already
completing or completed
* srun: error: Unable to allocate resources: No error
* srun: error: unpack error in io_init_msg_unpack
* srun: Job step 211042.0 aborted before step completely launched.
We have tried setting
ulimit -n 500000
ulimit -u 64000
but that wasn't sufficient.
The environment:
* CentOS 7.3 (x86_64)
* Slurm 17.11.0
Does this ring any bells? Any thoughts about how we should proceed?
Andy
--
Andy Riebs
andy.riebs at hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180212/bb9e4ca5/attachment.html>
More information about the slurm-users
mailing list