<html><body><div style="font-family: arial, helvetica, sans-serif; font-size: 12pt; color: #000000"><div>Hello,<br><br>I'm new to slurm (I used SGE before) and I'm new to this list. I have some difficulties with the use of slurm's array jobs, maybe you can help me?<br><br>I am working with slurm version 17.11.7 on a debian testing. I use slurmdbd and fairshare.<br><br>For my current user, I have the following limitations:<br>Fairshare = 99<br>MaxJobs = 50<br>MaxSubmitJobs = 100<br><br>I did a little C++ program hello_world to do some tests and a 100 job hello_world array job is working properly. </div><div>If I take the same program but I add a usleep of 10 seconds (to see the behavior with squeue and simulate a program a little longer), I have a part of my job that fails (FAILED) with a error 126:0 (output of sacct -j) and WEXITSTATUS 254 (in slurm log). The proportion of the error number of these jobs is variable between different executions. Here is the error output of one of these jobs:<br></div><div><br data-mce-bogus="1"></div><div>$ cat ERR/11617-9 <br>/var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable</div><div>/var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable</div><div>/var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable</div><div>/var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable<br>/var/slurm/slurmd/job11626/slurm_script: fork: Resource temporarily unavailable<br><br>Note I have enough resources to run more than 50 jobs at the same time ...<br><br>If I restart my submission script by forcing slurm to execute only 10 jobs at the same time (--array=1-100%10), all jobs succeed. But if I force slurm to execute only 30 jobs at the same time (--array=1-100%30), I have a part that fails again.<br><br>Has anyone ever faced this type of problem? If so, please kindly enlighten me.<br><br>Regards<br></div><div><br data-mce-bogus="1"></div><div>Jean-Mathieu Chantrein<br></div><div><span class="tlid-translation translation"><span title="" class="">In charge of the LERIA computing center</span></span></div><div><span class="tlid-translation translation"><span title="" class="">University of Angers</span></span></div><div><span class="tlid-translation translation"><span title="" class=""><br data-mce-bogus="1"></span></span></div><div><div>__________________<br></div><div><div>hello_array.slurm</div><div><br></div></div><div>#!/bin/bash<br># hello.slurm<br>#SBATCH --job-name=hello<br>#SBATCH --output=OUT/%A-%a<br>#SBATCH --error=ERR/%A-%a<span class="Object" role="link" id="OBJ_PREFIX_DWT277_ZmEmailObjectHandler"><span class="Object" role="link" id="OBJ_PREFIX_DWT278_ZmEmailObjectHandler"></span></span><br>#SBATCH --partition=std<br>#SBATCH --array=1-100%10<br>./hello $SLURM_ARRAY_TASK_ID<br></div><div><br></div><div>________________<br></div><div><div>main.cpp</div></div><div><br></div><div>#include <iostream><br>#include <unistd.h><br><br>int main(int arg, char** argv) {<br> usleep(10000000);<br> std::cout<<"Hello world! job array number "<<argv[1]<<std::endl;<br> return 0;<br>}</div></div><div data-marker="__SIG_PRE__"><div><table style="width: 320px;" cellpadding="3px"><tbody><tr><td width="80px"><br data-mce-bogus="1"></td><td style="text-align: justify;" width="240px"><br style="font-size: 8pt;"></td></tr></tbody></table></div></div></div></body></html>