[slurm-users] new user; ExitCode reporting

mercan ahmet.mercan at uhem.itu.edu.tr
Fri Nov 23 04:54:47 MST 2018


Hi;


As far as I know exit code 141 and 13 are the same. Signal + 128 gives 
exit code:

https://slurm-dev.schedmd.narkive.com/MYGH56EW/job-exit-codes


Ahmet M.



On 23.11.2018 14:36, Matthew Goulden wrote:
>
> A confirmation re-run yielded the same outcome but the correct outcome 
> was available using
>
> $ scontrol show job 197
>
>    JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=141:0
>
>
> sacct still reports as before
>
> $ sacct -j 197
>        JobID    JobName  Partition Account  AllocCPUS      State ExitCode
> ------------ ---------- ---------- ---------- ---------- ---------- 
> --------
> 197          T_113491_+ all_slt_l+        slt          1     
> FAILED     13:0
> 197.batch batch                   slt          1     FAILED     13:0
>
>
> Matt
>
> ------------------------------------------------------------------------
> *From:* Matthew Goulden
> *Sent:* Friday, November 23, 2018 11:21 AM
> *To:* slurm-users at lists.schedmd.com
> *Subject:* new user; ExitCode reporting
>
> Hi All,
>
>
> New using migrating from uge/sge, I'm baffled by the ExitCode 
> recording into slurmdb; not sure if this is 'new user' issue or bug, 
> so exposing it here first.
>
>
> Running simple sbatch scripts with these headers relevant
>
> #!/bin/bash
>
> #SBATCH --mail-user <me>@<work>
> #SBATCH --mail-type END
>
> #SBATCH -J T_113491_<redacted>_20150522
>
>
> The sbatch calls various tools, and terminally a 'completion_reporter' 
> bash script reporting whether all calls have proceeded to completion.
>
> If not the return_code from that script is passed into the sbatch 
> script as an exit command; the expectation is that the return code for 
> the sbatch script in these circumstances is that from the 
> completion_reporter'. That return_code is 141
>
>
> GOOD
>
> The emails received have subject line consistent with expectations
>
> 'Slurm Job_id=196 Name=T_113491_<redacted>_20150522 Ended, Run time 
> 00:00:24, FAILED, ExitCode 141'
>
>
> UNEXPECTED
>
> However sacct output is not consistent with expectations...
>
> $ sacct -j 196
>
> ------------ ---------- ---------- ---------- ---------- ---------- 
> --------
> 196          T_113491_+ all_slt_l+ slt          1     FAILED     13:0
> 196.batch         batch slt          1     FAILED     13:0
>
>
>
> I've spent some time reading through the (excellent, frankly) 
> documentation for sbatch and job_exit_code and while learning a great 
> deal nothing has explained with anomaly.
>
>
> Incidentally I expected to be able to use scontrol as below; any 
> pointers on the unexpected outcome would be welcome
>
> $ scontrol show step 196.batch
> Job step 196.0 not found
>
>
> We have put a fair bit of work into informatively coding our fail 
> exit_codes so suggestions as to what's going on here would be welcome.
>
>
> Thanks in advance
>
>
> Matt
>
>
>
>
> **************************************************************************
> The information contained in the EMail and any attachments is 
> confidential and intended solely and for the attention and use of the 
> named addressee(s). It may not be disclosed to any other person 
> without the express authority of Public Health England, or the 
> intended recipient, or both. If you are not the intended recipient, 
> you must not disclose, copy, distribute or retain this message or any 
> part of it. This footnote also confirms that this EMail has been swept 
> for computer viruses by Symantec.Cloud, but please re-sweep any 
> attachments before opening or saving. http://www.gov.uk/PHE
> **************************************************************************



More information about the slurm-users mailing list