[slurm-users] Array job execution trouble: some jobs in the array fail

Jean-mathieu CHANTREIN jean-mathieu.chantrein at univ-angers.fr
Fri Jan 11 09:51:43 UTC 2019


Hello, 

I'm new to slurm (I used SGE before) and I'm new to this list. I have some difficulties with the use of slurm's array jobs, maybe you can help me? 

I am working with slurm version 17.11.7 on a debian testing. I use slurmdbd and fairshare. 

For my current user, I have the following limitations: 
Fairshare = 99 
MaxJobs = 50 
MaxSubmitJobs = 100 

I did a little C++ program hello_world to do some tests and a 100 job hello_world array job is working properly. 
If I take the same program but I add a usleep of 10 seconds (to see the behavior with squeue and simulate a program a little longer), I have a part of my job that fails (FAILED) with a error 126:0 (output of sacct -j) and WEXITSTATUS 254 (in slurm log). The proportion of the error number of these jobs is variable between different executions. Here is the error output of one of these jobs: 

$ cat ERR/11617-9 
/var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable 
/var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable 
/var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable 
/var/slurm/slurmd/job11626/slurm_script: fork: retry: Resource temporarily unavailable 
/var/slurm/slurmd/job11626/slurm_script: fork: Resource temporarily unavailable 

Note I have enough resources to run more than 50 jobs at the same time ... 

If I restart my submission script by forcing slurm to execute only 10 jobs at the same time (--array=1-100%10), all jobs succeed. But if I force slurm to execute only 30 jobs at the same time (--array=1-100%30), I have a part that fails again. 

Has anyone ever faced this type of problem? If so, please kindly enlighten me. 

Regards 

Jean-Mathieu Chantrein 
In charge of the LERIA computing center 
University of Angers 

__________________ 
hello_array.slurm 

#!/bin/bash 
# hello.slurm 
#SBATCH --job-name=hello 
#SBATCH --output=OUT/%A-%a 
#SBATCH --error=ERR/%A-%a 
#SBATCH --partition=std 
#SBATCH --array=1-100%10 
./hello $SLURM_ARRAY_TASK_ID 

________________ 
main.cpp 

#include <iostream> 
#include <unistd.h> 

int main(int arg, char** argv) { 
usleep(10000000); 
std::cout<<"Hello world! job array number "<<argv[1]<<std::endl; 
return 0; 
} 

	
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190111/323cb463/attachment.html>


More information about the slurm-users mailing list