<div dir="ltr"><div>Hi Jean-Mathieu,</div><div><br></div><div>I'd also recommend that you update to 17.11.12. I had issues w/job arrays in 17.11.7,</div><div>such as tasks erroneously being held as "DependencyNeverSatisfied" that, I'm <br></div><div>pleased to report, I have not seen in .12.</div><div><br></div><div>Best,</div><div>Lyn<br></div></div><br><div class="gmail_quote"><div dir="ltr">On Fri, Jan 11, 2019 at 8:13 AM Jean-mathieu CHANTREIN <<a href="mailto:jean-mathieu.chantrein@univ-angers.fr">jean-mathieu.chantrein@univ-angers.fr</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div style="font-family:arial,helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><div><em><span class="gmail-m_5130277119479515064tlid-translation gmail-m_5130277119479515064translation"><span title="">You don't put any limitation on your master nodes ?</span></span></em></div><div><span class="gmail-m_5130277119479515064tlid-translation gmail-m_5130277119479515064translation"><br></span></div><div><span class="gmail-m_5130277119479515064tlid-translation gmail-m_5130277119479515064translation"><span title="">I answer myself.</span> <span title="">I only have to change the PropagateResourceLimits variable from slurm.conf to NONE.</span> <span title="">This is not a problem since I activate the cgroups directly on each of the compute nodes.</span></span></div><div><span class="gmail-m_5130277119479515064tlid-translation gmail-m_5130277119479515064translation"><span title=""><br></span></span></div><div><span class="gmail-m_5130277119479515064tlid-translation gmail-m_5130277119479515064translation"><span title="">Regards.<br></span></span></div><div><span class="gmail-m_5130277119479515064tlid-translation gmail-m_5130277119479515064translation"><span title=""><br></span></span></div><div><span class="gmail-m_5130277119479515064tlid-translation gmail-m_5130277119479515064translation"><span title="">Jean-Mathieu<br></span></span></div><div><br></div><hr id="gmail-m_5130277119479515064zwchr"><div><blockquote style="border-left:2px solid rgb(16,16,255);margin-left:5px;padding-left:5px;color:rgb(0,0,0);font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt"><b>De: </b>"Jean-Mathieu Chantrein" <<a href="mailto:jean-mathieu.chantrein@univ-angers.fr" target="_blank">jean-mathieu.chantrein@univ-angers.fr</a>><br><b>À: </b>"Slurm User Community List" <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a>><br><b>Envoyé: </b>Vendredi 11 Janvier 2019 15:55:35<br><b>Objet: </b>Re: [slurm-users] Array job execution trouble: some jobs in        the        array fail<br></blockquote></div><div><blockquote style="border-left:2px solid rgb(16,16,255);margin-left:5px;padding-left:5px;color:rgb(0,0,0);font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt"><div style="font-family:arial,helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)"><div>Hello Jeffrey.<br></div><br><div><span class="gmail-m_5130277119479515064tlid-translation gmail-m_5130277119479515064translation"><span title="">That's exactly it.</span> <span title="">I thank you very much, I would not have thought of that.</span> <span title="">I have actually put a limitation of  20 nproc in /etc/security/limits.conf to avoid potential misuse of some users.</span> <span title="">I had not imagined for one second that it could propagate on computational nodes!</span><br><br><span title="">You don't put any limitation on your master nodes ?</span><br><br><span title="">In any case, your help is particularly useful to me.</span> <span title="">Thanks a lot again.</span></span></div><div><span class="gmail-m_5130277119479515064tlid-translation gmail-m_5130277119479515064translation"><span title=""><br></span></span></div><div><span class="gmail-m_5130277119479515064tlid-translation gmail-m_5130277119479515064translation"><span title="">Best regards.<br></span></span></div><div><span class="gmail-m_5130277119479515064tlid-translation gmail-m_5130277119479515064translation"><span title=""><br></span></span></div><div><span class="gmail-m_5130277119479515064tlid-translation gmail-m_5130277119479515064translation"><span title="">Jean-Mathieu<br></span></span></div><br><br><hr id="gmail-m_5130277119479515064zwchr"><div><blockquote style="border-left:2px solid rgb(16,16,255);margin-left:5px;padding-left:5px;color:rgb(0,0,0);font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt"><b>De: </b>"Jeffrey Frey" <<a href="mailto:frey@udel.edu" target="_blank">frey@udel.edu</a>><br><b>À: </b>"Slurm User Community List" <<a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a>><br><b>Envoyé: </b>Vendredi 11 Janvier 2019 15:27:13<br><b>Objet: </b>Re: [slurm-users] Array job execution trouble: some jobs in the        array fail<br></blockquote></div><div><blockquote style="border-left:2px solid rgb(16,16,255);margin-left:5px;padding-left:5px;color:rgb(0,0,0);font-weight:normal;font-style:normal;text-decoration:none;font-family:Helvetica,Arial,sans-serif;font-size:12pt">What does ulimit tell you on the compute node(s) where the jobs are running?  The error message you cited arises when a user has reached the per-user process count limit (e.g. "ulimit -u").  If your Slurm config doesn't limit how many jobs a node can execute concurrently (e.g. oversubscribe), then:<div><br></div><div><br></div><blockquote style="margin:0px 0px 0px 40px;border:medium none;padding:0px"><div>- no matter what you have a race condition here (when/if the process limit is reached)</div><div><br></div><div>- the behavior is skewed toward happening more quickly/easily when your job actually lasts a non-trivial amount of time (e.g. by adding the usleep()).</div></blockquote><div><div><br></div><div><br></div><div>It's likely you have stringent limits on your head/login node that are getting propagated to the compute environment (see PropagateResourceLimits in the slurm.conf documentation).  By default Slurm propagates all ulimit's that are on your submission shell.</div><div><br></div><div><br></div><div>E.g.</div><div><div><br></div></div><blockquote style="margin:0px 0px 0px 40px;border:medium none;padding:0px"><div><div>[frey@login00 ~]$ srun ... --propagate=NONE /bin/bash</div></div><div><div>  [frey@login00 ~]$ hostname</div></div><div><div>  <a href="http://r00n56.localdomain.hpc.udel.edu" target="_blank">r00n56.localdomain.hpc.udel.edu</a><br></div></div><div><div>  [frey@login00 ~]$ ulimit -u</div></div><div><div>  4096</div></div>  [frey@login00 ~]$ exit<div>   :</div><div><div>[frey@login00 ~]$ ulimit -u 24</div></div><div><div>[frey@login00 ~]$ srun ... --propagate=ALL /bin/bash</div></div><div><div>  [frey@login00 ~]$ hostname</div></div><div><div>  <a href="http://r00n49.localdomain.hpc.udel.edu" target="_blank">r00n49.localdomain.hpc.udel.edu</a><br></div></div><div><div>  [frey@login00 ~]$ ulimit -u</div></div><div><div>  24</div></div><div><div>  [frey@login00 ~]$ exit</div></div></blockquote><div><div><br></div><div><br></div><div><blockquote>On Jan 11, 2019, at 4:51 AM, Jean-mathieu CHANTREIN <<a href="mailto:jean-mathieu.chantrein@univ-angers.fr" target="_blank">jean-mathieu.chantrein@univ-angers.fr</a>> wrote:<br><br>Hello,<br><br>I'm new to slurm (I used SGE before) and I'm new to this list. I have some difficulties with the use of slurm's array jobs, maybe you can help me?<br><br>I am working with slurm version 17.11.7 on a debian testing. I use slurmdbd and fairshare.<br><br>For my current user, I have the following limitations:<br>Fairshare = 99<br>MaxJobs = 50<br>MaxSubmitJobs = 100<br><br>I did a little C++ program hello_world to do some tests and a 100 job hello_world array job is working properly.<br>If I take the same program but I add a usleep of 10 seconds (to see the behavior with squeue and simulate a program a little longer), I have a part of my job that fails (FAILED) with a error 126:0 (output of sacct -j) and WEXITSTATUS 254 (in slurm log). The proportion of the error number of these jobs is variable between different executions. Here is the error output of one of these jobs:<br><br>$ cat ERR/11617-9 <br>/var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable<br>/var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable<br>/var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable<br>/var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable<br>/var/slurm/slurmd/job11626/slurm_script: fork: Resource temporarily unavailable<br><br>Note I have enough resources to run more than 50 jobs at the same time ...<br><br>If I restart my submission script by forcing slurm to execute only 10 jobs at the same time (--array=1-100%10), all jobs succeed. But if I force slurm to execute only 30 jobs at the same time (--array=1-100%30), I have a part that fails again.<br><br>Has anyone ever faced this type of problem? If so, please kindly enlighten me.<br><br>Regards<br><br>Jean-Mathieu Chantrein<br>In charge of the LERIA computing center<br>University of Angers<br><br>__________________<br>hello_array.slurm<br><br>#!/bin/bash<br># hello.slurm<br>#SBATCH --job-name=hello<br>#SBATCH --output=OUT/%A-%a<br>#SBATCH --error=ERR/%A-%a<br>#SBATCH --partition=std<br>#SBATCH --array=1-100%10<br>./hello $SLURM_ARRAY_TASK_ID<br><br>________________<br>main.cpp<br><br>#include <iostream><br>#include <unistd.h><br><br>int main(int arg, char** argv) {<br>    usleep(10000000);<br>    std::cout<<"Hello world! job array number "<<argv[1]<<std::endl;<br>    return 0;<br>}<br><br><br></blockquote><br><div><br>::::::::::::::::::::::::::::::::::::::::::::::::::::::<br>Jeffrey T. Frey, Ph.D.<br>Systems Programmer V / HPC Management<br>Network & Systems Services / College of Engineering<br>University of Delaware, Newark DE  19716<br>Office: (302) 831-6034  Mobile: (302) 419-4976<br>::::::::::::::::::::::::::::::::::::::::::::::::::::::<br><br><br><br></div><br></div></div></div></blockquote></div></div><br></blockquote></div></div></div></blockquote></div>