[slurm-users] new user; ExitCode reporting
mercan
ahmet.mercan at uhem.itu.edu.tr
Fri Nov 23 04:54:47 MST 2018
Hi;
As far as I know exit code 141 and 13 are the same. Signal + 128 gives
exit code:
https://slurm-dev.schedmd.narkive.com/MYGH56EW/job-exit-codes
Ahmet M.
On 23.11.2018 14:36, Matthew Goulden wrote:
>
> A confirmation re-run yielded the same outcome but the correct outcome
> was available using
>
> $ scontrol show job 197
>
> JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=141:0
>
>
> sacct still reports as before
>
> $ sacct -j 197
> JobID JobName Partition Account AllocCPUS State ExitCode
> ------------ ---------- ---------- ---------- ---------- ----------
> --------
> 197 T_113491_+ all_slt_l+ slt 1
> FAILED 13:0
> 197.batch batch slt 1 FAILED 13:0
>
>
> Matt
>
> ------------------------------------------------------------------------
> *From:* Matthew Goulden
> *Sent:* Friday, November 23, 2018 11:21 AM
> *To:* slurm-users at lists.schedmd.com
> *Subject:* new user; ExitCode reporting
>
> Hi All,
>
>
> New using migrating from uge/sge, I'm baffled by the ExitCode
> recording into slurmdb; not sure if this is 'new user' issue or bug,
> so exposing it here first.
>
>
> Running simple sbatch scripts with these headers relevant
>
> #!/bin/bash
>
> #SBATCH --mail-user <me>@<work>
> #SBATCH --mail-type END
>
> #SBATCH -J T_113491_<redacted>_20150522
>
>
> The sbatch calls various tools, and terminally a 'completion_reporter'
> bash script reporting whether all calls have proceeded to completion.
>
> If not the return_code from that script is passed into the sbatch
> script as an exit command; the expectation is that the return code for
> the sbatch script in these circumstances is that from the
> completion_reporter'. That return_code is 141
>
>
> GOOD
>
> The emails received have subject line consistent with expectations
>
> 'Slurm Job_id=196 Name=T_113491_<redacted>_20150522 Ended, Run time
> 00:00:24, FAILED, ExitCode 141'
>
>
> UNEXPECTED
>
> However sacct output is not consistent with expectations...
>
> $ sacct -j 196
>
> ------------ ---------- ---------- ---------- ---------- ----------
> --------
> 196 T_113491_+ all_slt_l+ slt 1 FAILED 13:0
> 196.batch batch slt 1 FAILED 13:0
>
>
>
> I've spent some time reading through the (excellent, frankly)
> documentation for sbatch and job_exit_code and while learning a great
> deal nothing has explained with anomaly.
>
>
> Incidentally I expected to be able to use scontrol as below; any
> pointers on the unexpected outcome would be welcome
>
> $ scontrol show step 196.batch
> Job step 196.0 not found
>
>
> We have put a fair bit of work into informatively coding our fail
> exit_codes so suggestions as to what's going on here would be welcome.
>
>
> Thanks in advance
>
>
> Matt
>
>
>
>
> **************************************************************************
> The information contained in the EMail and any attachments is
> confidential and intended solely and for the attention and use of the
> named addressee(s). It may not be disclosed to any other person
> without the express authority of Public Health England, or the
> intended recipient, or both. If you are not the intended recipient,
> you must not disclose, copy, distribute or retain this message or any
> part of it. This footnote also confirms that this EMail has been swept
> for computer viruses by Symantec.Cloud, but please re-sweep any
> attachments before opening or saving. http://www.gov.uk/PHE
> **************************************************************************
More information about the slurm-users
mailing list