<html dir="ltr"><head></head><body style="text-align:left; direction:ltr;"><div>Greetings everyone,</div><div><br></div><div>I have an issue with jobs I'm submiting, I have no idea how to solve it.</div><div><br></div><div>I submit the following script using sbatch:</div><div><br></div><div>#!/bin/bash</div><div>#SBATCH -t 1-00:00:00</div><div>#SBATCH --mem 8000 </div><div>#SBATCH -n 512</div><div>#SBATCH -p all # partition</div><div>#SBATCH -J 1day8GB # job name</div><div><br></div><div><br></div><div>ulimit -s unlimited</div><div><br></div><div>module purge</div><div>module load gcc63/netcdf/4.6.1</div><div>module load gcc63/netcdf-fortran/4.4.4</div><div><br></div><div>{ sleep 30 && while killall -q -0 pschism_LNEC_WS_GNU_VL_HA; do ps fux >> psFux.out; sleep 10; done; } &</div><div><br></div><div>mpirun pschism_LNEC_WS_GNU_VL_HA</div><div></div><div><br></div><div>The job runs for 1 hour 5 minutes and 11 seconds sharp, every time I submit it and then the slurm+++.out gives me the following output:</div><div><br></div><div>--------------------------------------------------------------------------</div><div>Primary job  terminated normally, but 1 process returned</div><div>a non-zero exit code. Per user-direction, the job has been aborted.</div><div>--------------------------------------------------------------------------</div><div>--------------------------------------------------------------------------</div><div>mpirun noticed that process rank 421 with PID 74958 on node wn058 exited on signal 11 (Segmentation fault).</div><div>--------------------------------------------------------------------------</div><div><br></div><div>When I view to the slurmd.log on the node that is specified in the error message i see the following:</div><div><br></div><div>[2019-09-12T10:19:32.693] launch task 3684.0 request from UID:4000005 GID:4000001 HOST:x.x.x.22 PORT:44211</div><div>[2019-09-12T10:19:32.695] _run_prolog: run job script took usec=15</div><div>[2019-09-12T10:19:32.695] _run_prolog: prolog with lock for job 3684 ran for 0 seconds</div><div>[2019-09-12T11:24:37.306] [3684.0] done with job</div><div><br></div><div>This issue is happening to a lot of users and I have no idea where to look for anymore.</div><div>I tried submiting soverall jobs using openmpi to see if that was the issue and they all finish ok.</div><div>Does anyone have any idea how to debug this issue?</div><div></div><div><br></div><div><span><div>Cumprimentos / Best Regards,</div><div>Zacarias Benta</div><div>INCD @ LIP - UMinho</div><div>  </div><div><br></div><div><img src="cid:b1d1ea2babab52d43f151f4831c82e8bff5e1935.camel@lip.pt" width="156">          <br></div></span></div></body></html>