[slurm-users] Array job execution trouble: some jobs in the array fail

Fri Jan 11 17:07:25 UTC 2019

Hi Jean-Mathieu,

I'd also recommend that you update to 17.11.12. I had issues w/job arrays
in 17.11.7,
such as tasks erroneously being held as "DependencyNeverSatisfied" that,
I'm
pleased to report, I have not seen in .12.

Best,
Lyn

On Fri, Jan 11, 2019 at 8:13 AM Jean-mathieu CHANTREIN <
jean-mathieu.chantrein at univ-angers.fr> wrote:

> *You don't put any limitation on your master nodes ?*
>
> I answer myself. I only have to change the PropagateResourceLimits
> variable from slurm.conf to NONE. This is not a problem since I activate
> the cgroups directly on each of the compute nodes.
>
> Regards.
>
> Jean-Mathieu
>
> ------------------------------
>
> *De: *"Jean-Mathieu Chantrein" <jean-mathieu.chantrein at univ-angers.fr>
> *À: *"Slurm User Community List" <slurm-users at lists.schedmd.com>
> *Envoyé: *Vendredi 11 Janvier 2019 15:55:35
> *Objet: *Re: [slurm-users] Array job execution trouble: some jobs
> in        the        array fail
>
> Hello Jeffrey.
>
> That's exactly it. I thank you very much, I would not have thought of
> that. I have actually put a limitation of  20 nproc in
> /etc/security/limits.conf to avoid potential misuse of some users. I had
> not imagined for one second that it could propagate on computational nodes!
>
> You don't put any limitation on your master nodes ?
>
> In any case, your help is particularly useful to me. Thanks a lot again.
>
> Best regards.
>
> Jean-Mathieu
>
>
> ------------------------------
>
> *De: *"Jeffrey Frey" <frey at udel.edu>
> *À: *"Slurm User Community List" <slurm-users at lists.schedmd.com>
> *Envoyé: *Vendredi 11 Janvier 2019 15:27:13
> *Objet: *Re: [slurm-users] Array job execution trouble: some jobs in
> the        array fail
>
> What does ulimit tell you on the compute node(s) where the jobs are
> running?  The error message you cited arises when a user has reached the
> per-user process count limit (e.g. "ulimit -u").  If your Slurm config
> doesn't limit how many jobs a node can execute concurrently (e.g.
> oversubscribe), then:
>
>
> - no matter what you have a race condition here (when/if the process limit
> is reached)
>
> - the behavior is skewed toward happening more quickly/easily when your
> job actually lasts a non-trivial amount of time (e.g. by adding the
> usleep()).
>
>
>
> It's likely you have stringent limits on your head/login node that are
> getting propagated to the compute environment (see PropagateResourceLimits
> in the slurm.conf documentation).  By default Slurm propagates all ulimit's
> that are on your submission shell.
>
>
> E.g.
>
> [frey at login00 ~]$ srun ... --propagate=NONE /bin/bash
>   [frey at login00 ~]$ hostname
>   r00n56.localdomain.hpc.udel.edu
>   [frey at login00 ~]$ ulimit -u
>   4096
>   [frey at login00 ~]$ exit
>    :
> [frey at login00 ~]$ ulimit -u 24
> [frey at login00 ~]$ srun ... --propagate=ALL /bin/bash
>   [frey at login00 ~]$ hostname
>   r00n49.localdomain.hpc.udel.edu
>   [frey at login00 ~]$ ulimit -u
>   24
>   [frey at login00 ~]$ exit
>
>
>
> On Jan 11, 2019, at 4:51 AM, Jean-mathieu CHANTREIN <
> jean-mathieu.chantrein at univ-angers.fr> wrote:
>
> Hello,
>
> I'm new to slurm (I used SGE before) and I'm new to this list. I have some
> difficulties with the use of slurm's array jobs, maybe you can help me?
>
> I am working with slurm version 17.11.7 on a debian testing. I use
> slurmdbd and fairshare.
>
> For my current user, I have the following limitations:
> Fairshare = 99
> MaxJobs = 50
> MaxSubmitJobs = 100
>
> I did a little C++ program hello_world to do some tests and a 100 job
> hello_world array job is working properly.
> If I take the same program but I add a usleep of 10 seconds (to see the
> behavior with squeue and simulate a program a little longer), I have a part
> of my job that fails (FAILED) with a error 126:0 (output of sacct -j) and
> WEXITSTATUS 254 (in slurm log). The proportion of the error number of these
> jobs is variable between different executions. Here is the error output of
> one of these jobs:
>
> $ cat ERR/11617-9
> /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily
> unavailable
> /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily
> unavailable
> /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily
> unavailable
> /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily
> unavailable
> /var/slurm/slurmd/job11626/slurm_script: fork: Resource temporarily
> unavailable
>
> Note I have enough resources to run more than 50 jobs at the same time ...
>
> If I restart my submission script by forcing slurm to execute only 10 jobs
> at the same time (--array=1-100%10), all jobs succeed. But if I force slurm
> to execute only 30 jobs at the same time (--array=1-100%30), I have a part
> that fails again.
>
> Has anyone ever faced this type of problem? If so, please kindly enlighten
> me.
>
> Regards
>
> Jean-Mathieu Chantrein
> In charge of the LERIA computing center
> University of Angers
>
> __________________
> hello_array.slurm
>
> #!/bin/bash
> # hello.slurm
> #SBATCH --job-name=hello
> #SBATCH --output=OUT/%A-%a
> #SBATCH --error=ERR/%A-%a
> #SBATCH --partition=std
> #SBATCH --array=1-100%10
> ./hello $SLURM_ARRAY_TASK_ID
>
> ________________
> main.cpp
>
> #include <iostream>
> #include <unistd.h>
>
> int main(int arg, char** argv) {
>     usleep(10000000);
>     std::cout<<"Hello world! job array number "<<argv[1]<<std::endl;
>     return 0;
> }
>
>
>
>
> ::::::::::::::::::::::::::::::::::::::::::::::::::::::
> Jeffrey T. Frey, Ph.D.
> Systems Programmer V / HPC Management
> Network & Systems Services / College of Engineering
> University of Delaware, Newark DE  19716
> Office: (302) 831-6034  Mobile: (302) 419-4976
> ::::::::::::::::::::::::::::::::::::::::::::::::::::::
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190111/845e0fce/attachment-0001.html>