[slurm-users] Array job execution trouble: some jobs in the array fail

Fri Jan 11 15:10:32 UTC 2019

You don't put any limitation on your master nodes ? 

I answer myself. I only have to change the PropagateResourceLimits variable from slurm.conf to NONE. This is not a problem since I activate the cgroups directly on each of the compute nodes. 

Regards. 

Jean-Mathieu 

> De: "Jean-Mathieu Chantrein" <jean-mathieu.chantrein at univ-angers.fr>
> À: "Slurm User Community List" <slurm-users at lists.schedmd.com>
> Envoyé: Vendredi 11 Janvier 2019 15:55:35
> Objet: Re: [slurm-users] Array job execution trouble: some jobs in the array
> fail

> Hello Jeffrey.

> That's exactly it. I thank you very much, I would not have thought of that. I
> have actually put a limitation of 20 nproc in /etc/security/limits.conf to
> avoid potential misuse of some users. I had not imagined for one second that it
> could propagate on computational nodes!

> You don't put any limitation on your master nodes ?

> In any case, your help is particularly useful to me. Thanks a lot again.

> Best regards.

> Jean-Mathieu

>> De: "Jeffrey Frey" <frey at udel.edu>
>> À: "Slurm User Community List" <slurm-users at lists.schedmd.com>
>> Envoyé: Vendredi 11 Janvier 2019 15:27:13
>> Objet: Re: [slurm-users] Array job execution trouble: some jobs in the array
>> fail

>> What does ulimit tell you on the compute node(s) where the jobs are running? The
>> error message you cited arises when a user has reached the per-user process
>> count limit (e.g. "ulimit -u"). If your Slurm config doesn't limit how many
>> jobs a node can execute concurrently (e.g. oversubscribe), then:

>>> - no matter what you have a race condition here (when/if the process limit is
>>> reached)

>>> - the behavior is skewed toward happening more quickly/easily when your job
>>> actually lasts a non-trivial amount of time (e.g. by adding the usleep()).

>> It's likely you have stringent limits on your head/login node that are getting
>> propagated to the compute environment (see PropagateResourceLimits in the
>> slurm.conf documentation). By default Slurm propagates all ulimit's that are on
>> your submission shell.

>> E.g.

>>> [frey at login00 ~]$ srun ... --propagate=NONE /bin/bash
>>> [frey at login00 ~]$ hostname
>>> [ http://r00n56.localdomain.hpc.udel.edu/ | r00n56.localdomain.hpc.udel.edu ]
>>> [frey at login00 ~]$ ulimit -u
>>> 4096
>>> [frey at login00 ~]$ exit
>>> :
>>> [frey at login00 ~]$ ulimit -u 24
>>> [frey at login00 ~]$ srun ... --propagate=ALL /bin/bash
>>> [frey at login00 ~]$ hostname
>>> [ http://r00n49.localdomain.hpc.udel.edu/ | r00n49.localdomain.hpc.udel.edu ]
>>> [frey at login00 ~]$ ulimit -u
>>> 24
>>> [frey at login00 ~]$ exit

>>> On Jan 11, 2019, at 4:51 AM, Jean-mathieu CHANTREIN < [
>>> mailto:jean-mathieu.chantrein at univ-angers.fr |
>>> jean-mathieu.chantrein at univ-angers.fr ] > wrote:

>>> Hello,

>>> I'm new to slurm (I used SGE before) and I'm new to this list. I have some
>>> difficulties with the use of slurm's array jobs, maybe you can help me?

>>> I am working with slurm version 17.11.7 on a debian testing. I use slurmdbd and
>>> fairshare.

>>> For my current user, I have the following limitations:
>>> Fairshare = 99
>>> MaxJobs = 50
>>> MaxSubmitJobs = 100

>>> I did a little C++ program hello_world to do some tests and a 100 job
>>> hello_world array job is working properly.
>>> If I take the same program but I add a usleep of 10 seconds (to see the behavior
>>> with squeue and simulate a program a little longer), I have a part of my job
>>> that fails (FAILED) with a error 126:0 (output of sacct -j) and WEXITSTATUS 254
>>> (in slurm log). The proportion of the error number of these jobs is variable
>>> between different executions. Here is the error output of one of these jobs:

>>> $ cat ERR/11617-9
>>> /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily
>>> unavailable
>>> /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily
>>> unavailable
>>> /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily
>>> unavailable
>>> /var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily
>>> unavailable
>>> /var/slurm/slurmd/job11626/slurm_script: fork: Resource temporarily unavailable

>>> Note I have enough resources to run more than 50 jobs at the same time ...

>>> If I restart my submission script by forcing slurm to execute only 10 jobs at
>>> the same time (--array=1-100%10), all jobs succeed. But if I force slurm to
>>> execute only 30 jobs at the same time (--array=1-100%30), I have a part that
>>> fails again.

>>> Has anyone ever faced this type of problem? If so, please kindly enlighten me.

>>> Regards

>>> Jean-Mathieu Chantrein
>>> In charge of the LERIA computing center
>>> University of Angers

>>> __________________
>>> hello_array.slurm

>>> #!/bin/bash
>>> # hello.slurm
>>> #SBATCH --job-name=hello
>>> #SBATCH --output=OUT/%A-%a
>>> #SBATCH --error=ERR/%A-%a
>>> #SBATCH --partition=std
>>> #SBATCH --array=1-100%10
>>> ./hello $SLURM_ARRAY_TASK_ID

>>> ________________
>>> main.cpp

>>> #include <iostream>
>>> #include <unistd.h>

>>> int main(int arg, char** argv) {
>>> usleep(10000000);
>>> std::cout<<"Hello world! job array number "<<argv[1]<<std::endl;
>>> return 0;
>>> }

>> ::::::::::::::::::::::::::::::::::::::::::::::::::::::
>> Jeffrey T. Frey, Ph.D.
>> Systems Programmer V / HPC Management
>> Network & Systems Services / College of Engineering
>> University of Delaware, Newark DE 19716
>> Office: (302) 831-6034 Mobile: (302) 419-4976
>> ::::::::::::::::::::::::::::::::::::::::::::::::::::::
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190111/35f59662/attachment-0001.html>