[slurm-users] Jobs stop after 1:05:11 with segmentation faul.

Zacarias Benta zacarias at lip.pt
Thu Sep 12 10:49:19 UTC 2019


Greetings everyone,

I have an issue with jobs I'm submiting, I have no idea how to solve
it.

I submit the following script using sbatch:

#!/bin/bash
#SBATCH -t 1-00:00:00
#SBATCH --mem 8000 
#SBATCH -n 512
#SBATCH -p all # partition
#SBATCH -J 1day8GB # job name


ulimit -s unlimited

module purge
module load gcc63/netcdf/4.6.1
module load gcc63/netcdf-fortran/4.4.4

{ sleep 30 && while killall -q -0 pschism_LNEC_WS_GNU_VL_HA; do ps fux
>> psFux.out; sleep 10; done; } &

mpirun pschism_LNEC_WS_GNU_VL_HA


The job runs for 1 hour 5 minutes and 11 seconds sharp, every time I
submit it and then the slurm+++.out gives me the following output:

---------------------------------------------------------------------
-----
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
---------------------------------------------------------------------
-----
---------------------------------------------------------------------
-----
mpirun noticed that process rank 421 with PID 74958 on node wn058
exited on signal 11 (Segmentation fault).
---------------------------------------------------------------------
-----

When I view to the slurmd.log on the node that is specified in the
error message i see the following:

[2019-09-12T10:19:32.693] launch task 3684.0 request from UID:4000005
GID:4000001 HOST:x.x.x.22 PORT:44211
[2019-09-12T10:19:32.695] _run_prolog: run job script took usec=15
[2019-09-12T10:19:32.695] _run_prolog: prolog with lock for job 3684
ran for 0 seconds
[2019-09-12T11:24:37.306] [3684.0] done with job

This issue is happening to a lot of users and I have no idea where to
look for anymore.
I tried submiting soverall jobs using openmpi to see if that was the
issue and they all finish ok.
Does anyone have any idea how to debug this issue?



Cumprimentos / Best Regards,
Zacarias Benta
INCD @ LIP - UMinho
  

     	
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190912/91235f8f/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Image-IB3JUZ.png
Type: image/png
Size: 17131 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190912/91235f8f/attachment-0001.png>


More information about the slurm-users mailing list