[slurm-users] MPI job termination

Reuti reuti at staff.uni-marburg.de
Sun Apr 7 18:05:18 UTC 2019


> Am 07.04.2019 um 19:15 schrieb Mahmood Naderan <mahmood.nt at gmail.com>:
> 
> Hi,
> A multinode MPI job terminated with the following messages in the log file
> 
> =------------------------------------------------------------------------------=
>    JOB DONE.
> =------------------------------------------------------------------------------=
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> STOP 2
> -------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> STOP 2
> STOP 2
> --------------------------------------------------------------------------
> mpirun detected that one or more processes exited with non-zero status, thus causing
> the job to be terminated. The first process to do so was:
> 
>   Process name: [[9801,1],8]
>   Exit code:    2
> ------------------------------------
> 
> 
> Although it said job is done, I would like to know if there is any abnormal termination for that.
> Moreover, I can not figure out if there is a problem with the input files or not. For example, maybe the calculations diverged. But this error can not clarify that.
> Any idea?

This seems to be unrelated to SLURM.

I assume you are using Open MPI. In Open MPI *all* processes must exit with an exit code of zero, otherwiese an error in the application is assumed – even if  MPI_Finalize() was called before and not MPI_ABORT(). This is of course a point of disussion: at least the rank zero should be able to give an exit code besides zero back to the calling script (IMO). I suggest to raise this question on the Open MPI maling list.

I don't know what the MPI standard says about it, but with Intel MPI it's different: an exit after MPI_Finalize() is treated as a normal program termination. The highest value returned by any of the processes will be returned to the job script and no application error is raised. Hence one can act on this return code in a proper way in the job script.

-- Reuti


More information about the slurm-users mailing list