<html><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">What does ulimit tell you on the compute node(s) where the jobs are running? The error message you cited arises when a user has reached the per-user process count limit (e.g. "ulimit -u"). If your Slurm config doesn't limit how many jobs a node can execute concurrently (e.g. oversubscribe), then:<div class=""><br class=""></div><div class=""><br class=""></div><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;" class=""><div class="">- no matter what you have a race condition here (when/if the process limit is reached)</div><div class=""><br class=""></div><div class="">- the behavior is skewed toward happening more quickly/easily when your job actually lasts a non-trivial amount of time (e.g. by adding the usleep()).</div></blockquote><div class=""><div class=""><br class=""></div><div class=""><br class=""></div><div class="">It's likely you have stringent limits on your head/login node that are getting propagated to the compute environment (see PropagateResourceLimits in the slurm.conf documentation). By default Slurm propagates all ulimit's that are on your submission shell.</div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">E.g.</div><div class=""><div class=""><br class=""></div></div><blockquote style="margin: 0 0 0 40px; border: none; padding: 0px;" class=""><div class=""><div class="">[frey@login00 ~]$ srun ... --propagate=NONE /bin/bash</div></div><div class=""><div class=""> [frey@login00 ~]$ hostname</div></div><div class=""><div class=""> <a href="http://r00n56.localdomain.hpc.udel.edu" class="">r00n56.localdomain.hpc.udel.edu</a></div></div><div class=""><div class=""> [frey@login00 ~]$ ulimit -u</div></div><div class=""><div class=""> 4096</div></div> [frey@login00 ~]$ exit<div class=""> :</div><div class=""><div class="">[frey@login00 ~]$ ulimit -u 24</div></div><div class=""><div class="">[frey@login00 ~]$ srun ... --propagate=ALL /bin/bash</div></div><div class=""><div class=""> [frey@login00 ~]$ hostname</div></div><div class=""><div class=""> <a href="http://r00n49.localdomain.hpc.udel.edu" class="">r00n49.localdomain.hpc.udel.edu</a></div></div><div class=""><div class=""> [frey@login00 ~]$ ulimit -u</div></div><div class=""><div class=""> 24</div></div><div class=""><div class=""> [frey@login00 ~]$ exit</div></div></blockquote><div class=""><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><blockquote type="cite" class="">On Jan 11, 2019, at 4:51 AM, Jean-mathieu CHANTREIN <<a href="mailto:jean-mathieu.chantrein@univ-angers.fr" class="">jean-mathieu.chantrein@univ-angers.fr</a>> wrote:<br class=""><br class="">Hello,<br class=""><br class="">I'm new to slurm (I used SGE before) and I'm new to this list. I have some difficulties with the use of slurm's array jobs, maybe you can help me?<br class=""><br class="">I am working with slurm version 17.11.7 on a debian testing. I use slurmdbd and fairshare.<br class=""><br class="">For my current user, I have the following limitations:<br class="">Fairshare = 99<br class="">MaxJobs = 50<br class="">MaxSubmitJobs = 100<br class=""><br class="">I did a little C++ program hello_world to do some tests and a 100 job hello_world array job is working properly.<br class="">If I take the same program but I add a usleep of 10 seconds (to see the behavior with squeue and simulate a program a little longer), I have a part of my job that fails (FAILED) with a error 126:0 (output of sacct -j) and WEXITSTATUS 254 (in slurm log). The proportion of the error number of these jobs is variable between different executions. Here is the error output of one of these jobs:<br class=""><br class="">$ cat ERR/11617-9 <br class="">/var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable<br class="">/var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable<br class="">/var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable<br class="">/var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable<br class="">/var/slurm/slurmd/job11626/slurm_script: fork: Resource temporarily unavailable<br class=""><br class="">Note I have enough resources to run more than 50 jobs at the same time ...<br class=""><br class="">If I restart my submission script by forcing slurm to execute only 10 jobs at the same time (--array=1-100%10), all jobs succeed. But if I force slurm to execute only 30 jobs at the same time (--array=1-100%30), I have a part that fails again.<br class=""><br class="">Has anyone ever faced this type of problem? If so, please kindly enlighten me.<br class=""><br class="">Regards<br class=""><br class="">Jean-Mathieu Chantrein<br class="">In charge of the LERIA computing center<br class="">University of Angers<br class=""><br class="">__________________<br class="">hello_array.slurm<br class=""><br class="">#!/bin/bash<br class=""># hello.slurm<br class="">#SBATCH --job-name=hello<br class="">#SBATCH --output=OUT/%A-%a<br class="">#SBATCH --error=ERR/%A-%a<br class="">#SBATCH --partition=std<br class="">#SBATCH --array=1-100%10<br class="">./hello $SLURM_ARRAY_TASK_ID<br class=""><br class="">________________<br class="">main.cpp<br class=""><br class="">#include <iostream><br class="">#include <unistd.h><br class=""><br class="">int main(int arg, char** argv) {<br class=""> usleep(10000000);<br class=""> std::cout<<"Hello world! job array number "<<argv[1]<<std::endl;<br class=""> return 0;<br class="">}<br class=""><br class=""><br class=""></blockquote><br class=""><div class=""><br class="">::::::::::::::::::::::::::::::::::::::::::::::::::::::<br class="">Jeffrey T. Frey, Ph.D.<br class="">Systems Programmer V / HPC Management<br class="">Network & Systems Services / College of Engineering<br class="">University of Delaware, Newark DE 19716<br class="">Office: (302) 831-6034 Mobile: (302) 419-4976<br class="">::::::::::::::::::::::::::::::::::::::::::::::::::::::<br class=""><br class=""><br class=""><br class=""></div><br class=""></div></div></div></body></html>