[slurm-users] srun timeout exit status bug?

Dan Boorstein dan at boorstein.net
Wed May 2 13:56:45 MDT 2018


Hi All,

I've encountered what I think is a bug with srun's exit status when a
timeout occurs, but perhaps my expectation is off. My expectation is for
srun to have a non-zero exit status when a timeout occurs before all tasks
can complete.

This behaves as expected when all tasks are timed out:

> srun --time 1 --ntasks=2 perl -e 'sleep 120 + 120 * $ENV{SLURM_PROCID}';
echo "status: $?"

    srun: Force Terminated job 2392836
    srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
    slurmstepd: error: *** STEP 2392836.0 ON foo0205 CANCELLED AT
2018-04-19T18:33:34 DUE TO TIME LIMIT ***
    srun: error: foo0205: tasks 0-1: Terminated
    status: 143

 However, when some tasks complete, while others are timed out, srun always
exits with a zero status. This is not what I expect, since tasks were
forcefully terminated:

> srun --time 3 --ntasks=2 perl -e 'sleep 120 + 120 * $ENV{SLURM_PROCID}';
echo "status: $?"

    srun: Force Terminated job 2392845
    srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
    slurmstepd: error: *** STEP 2392845.0 ON foo3009 CANCELLED AT
2018-04-19T18:37:04 DUE TO TIME LIMIT ***
    srun: error: foo3009: task 1: Terminated
    status: 0

Is my expectation off, or does this look like a genuine bug?

Thanks,

  - Dan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180502/3a0ecf1a/attachment.html>


More information about the slurm-users mailing list