[slurm-users] Array job execution trouble: some jobs in the array fail
Jean-mathieu CHANTREIN
jean-mathieu.chantrein at univ-angers.fr
Fri Jan 11 09:51:43 UTC 2019
Hello,
I'm new to slurm (I used SGE before) and I'm new to this list. I have some difficulties with the use of slurm's array jobs, maybe you can help me?
I am working with slurm version 17.11.7 on a debian testing. I use slurmdbd and fairshare.
For my current user, I have the following limitations:
Fairshare = 99
MaxJobs = 50
MaxSubmitJobs = 100
I did a little C++ program hello_world to do some tests and a 100 job hello_world array job is working properly.
If I take the same program but I add a usleep of 10 seconds (to see the behavior with squeue and simulate a program a little longer), I have a part of my job that fails (FAILED) with a error 126:0 (output of sacct -j) and WEXITSTATUS 254 (in slurm log). The proportion of the error number of these jobs is variable between different executions. Here is the error output of one of these jobs:
$ cat ERR/11617-9
/var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable
/var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable
/var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable
/var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable
/var/slurm/slurmd/job11626/slurm_script: fork: Resource temporarily unavailable
Note I have enough resources to run more than 50 jobs at the same time ...
If I restart my submission script by forcing slurm to execute only 10 jobs at the same time (--array=1-100%10), all jobs succeed. But if I force slurm to execute only 30 jobs at the same time (--array=1-100%30), I have a part that fails again.
Has anyone ever faced this type of problem? If so, please kindly enlighten me.
Regards
Jean-Mathieu Chantrein
In charge of the LERIA computing center
University of Angers
__________________
hello_array.slurm
#!/bin/bash
# hello.slurm
#SBATCH --job-name=hello
#SBATCH --output=OUT/%A-%a
#SBATCH --error=ERR/%A-%a
#SBATCH --partition=std
#SBATCH --array=1-100%10
./hello $SLURM_ARRAY_TASK_ID
________________
main.cpp
#include <iostream>
#include <unistd.h>
int main(int arg, char** argv) {
usleep(10000000);
std::cout<<"Hello world! job array number "<<argv[1]<<std::endl;
return 0;
}
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190111/323cb463/attachment.html>
More information about the slurm-users
mailing list