<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body bgcolor="#FFFFFF" text="#000000">
We have a user who wants to run multiple instances of a single
process job across a cluster, using a loop like <br>
<p> ----- <br>
for N in $nodelist; do <br>
srun -w $N program & <br>
done <br>
wait <br>
----- <br>
</p>
<p> This works up to a thousand nodes or so (jobs are allocated by
node here), but as the number of jobs submitted increases, we
periodically see a variety of different error messages, such as <br>
</p>
<ul>
<p> </p>
<li> srun: error: Ignoring job_complete for job 100035 because our
job ID is 102937 <br>
</li>
<li> srun: error: io_init_msg_read too small <br>
</li>
<li> srun: error: task 0 launch failed: Unspecified error <br>
</li>
<li> srun: error: Unable to allocate resources: Job/step already
completing or completed <br>
</li>
<li> srun: error: Unable to allocate resources: No error <br>
</li>
<li> srun: error: unpack error in io_init_msg_unpack <br>
</li>
<li> srun: Job step 211042.0 aborted before step completely
launched. <br>
</li>
</ul>
<p> We have tried setting <br>
</p>
<ul>
ulimit -n 500000 <br>
ulimit -u 64000 <br>
</ul>
but that wasn't sufficient. <br>
<p> The environment: <br>
</p>
<ul>
<li> CentOS 7.3 (x86_64) <br>
</li>
<li> Slurm 17.11.0 <br>
</li>
</ul>
<p> Does this ring any bells? Any thoughts about how we should
proceed?<br>
</p>
Andy
<pre class="moz-signature" cols="72">--
Andy Riebs
<a class="moz-txt-link-abbreviated" href="mailto:andy.riebs@hpe.com">andy.riebs@hpe.com</a>
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!
</pre>
</body>
</html>