[slurm-users] Jobs stop after 1:05:11 with segmentation faul.
Zacarias Benta
zacarias at lip.pt
Thu Sep 12 10:49:19 UTC 2019
Greetings everyone,
I have an issue with jobs I'm submiting, I have no idea how to solve
it.
I submit the following script using sbatch:
#!/bin/bash
#SBATCH -t 1-00:00:00
#SBATCH --mem 8000
#SBATCH -n 512
#SBATCH -p all # partition
#SBATCH -J 1day8GB # job name
ulimit -s unlimited
module purge
module load gcc63/netcdf/4.6.1
module load gcc63/netcdf-fortran/4.4.4
{ sleep 30 && while killall -q -0 pschism_LNEC_WS_GNU_VL_HA; do ps fux
>> psFux.out; sleep 10; done; } &
mpirun pschism_LNEC_WS_GNU_VL_HA
The job runs for 1 hour 5 minutes and 11 seconds sharp, every time I
submit it and then the slurm+++.out gives me the following output:
---------------------------------------------------------------------
-----
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
---------------------------------------------------------------------
-----
---------------------------------------------------------------------
-----
mpirun noticed that process rank 421 with PID 74958 on node wn058
exited on signal 11 (Segmentation fault).
---------------------------------------------------------------------
-----
When I view to the slurmd.log on the node that is specified in the
error message i see the following:
[2019-09-12T10:19:32.693] launch task 3684.0 request from UID:4000005
GID:4000001 HOST:x.x.x.22 PORT:44211
[2019-09-12T10:19:32.695] _run_prolog: run job script took usec=15
[2019-09-12T10:19:32.695] _run_prolog: prolog with lock for job 3684
ran for 0 seconds
[2019-09-12T11:24:37.306] [3684.0] done with job
This issue is happening to a lot of users and I have no idea where to
look for anymore.
I tried submiting soverall jobs using openmpi to see if that was the
issue and they all finish ok.
Does anyone have any idea how to debug this issue?
Cumprimentos / Best Regards,
Zacarias Benta
INCD @ LIP - UMinho
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190912/91235f8f/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Image-IB3JUZ.png
Type: image/png
Size: 17131 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190912/91235f8f/attachment-0001.png>
More information about the slurm-users
mailing list