[slurm-users] srun timeout exit status bug?
Dan Boorstein
dan at boorstein.net
Wed May 2 13:56:45 MDT 2018
Hi All,
I've encountered what I think is a bug with srun's exit status when a
timeout occurs, but perhaps my expectation is off. My expectation is for
srun to have a non-zero exit status when a timeout occurs before all tasks
can complete.
This behaves as expected when all tasks are timed out:
> srun --time 1 --ntasks=2 perl -e 'sleep 120 + 120 * $ENV{SLURM_PROCID}';
echo "status: $?"
srun: Force Terminated job 2392836
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
slurmstepd: error: *** STEP 2392836.0 ON foo0205 CANCELLED AT
2018-04-19T18:33:34 DUE TO TIME LIMIT ***
srun: error: foo0205: tasks 0-1: Terminated
status: 143
However, when some tasks complete, while others are timed out, srun always
exits with a zero status. This is not what I expect, since tasks were
forcefully terminated:
> srun --time 3 --ntasks=2 perl -e 'sleep 120 + 120 * $ENV{SLURM_PROCID}';
echo "status: $?"
srun: Force Terminated job 2392845
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
slurmstepd: error: *** STEP 2392845.0 ON foo3009 CANCELLED AT
2018-04-19T18:37:04 DUE TO TIME LIMIT ***
srun: error: foo3009: task 1: Terminated
status: 0
Is my expectation off, or does this look like a genuine bug?
Thanks,
- Dan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180502/3a0ecf1a/attachment.html>
More information about the slurm-users
mailing list