[slurm-users] new user; ExitCode reporting
Chris Samuel
chris at csamuel.org
Fri Nov 23 05:09:58 MST 2018
On Friday, 23 November 2018 10:21:09 PM AEDT Matthew Goulden wrote:
> I've spent some time reading through the (excellent, frankly) documentation
> for sbatch and job_exit_code and while learning a great deal nothing has
> explained with anomaly.
I suspect Slurm is trying to be helpful, as exit codes > 128 are usually the
result of a process being terminated by signal N + 128, so sacct subtracts 128
from exit values greater than 128. The bash manual page says:
The return value of a simple command is its exit status, or 128+n if
the command is terminated by signal n.
This is what sacct does (it appears the right value is in the DB):
if (exit_code != NO_VAL) {
if (WIFSIGNALED(exit_code))
tmp_int2 = WTERMSIG(exit_code);
else if (WIFEXITED(exit_code))
tmp_int = WEXITSTATUS(exit_code);
if (tmp_int >= 128)
tmp_int -= 128;
}
For you 128+13 = 141.
*If* your job uses srun you can ask Slurm to tell you the DerivedExitCode, but
that will be the highest exit code from all the invocations, but it will be
your expected number as it's not converted by sacct.
$ sbatch --wrap 'srun bash -c "exit 141"'
Submitted batch job 1795583
$ sacct -j 1795583
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1795583 wrap skylake hpcadmin 1 FAILED 13:0
1795583.bat+ batch hpcadmin 1 FAILED 13:0
1795583.ext+ extern hpcadmin 1 COMPLETED 0:0
1795583.0 bash hpcadmin 1 FAILED 13:0
$ sacct -j 1795583 -o jobid,jobname,state,derivedexitcode -X
JobID JobName State DerivedExitCode
------------ ---------- ---------- ---------------
1795583 wrap FAILED 141:0
Hope that helps!
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
More information about the slurm-users
mailing list