[slurm-users] Mysterious job terminations on Slurm 17.11.10

Fri Feb 1 01:45:38 UTC 2019

Perhaps fire from srun with -vvv to get maximum verbose messages as srun
fires through job.

Doug

On Thu, Jan 31, 2019 at 12:07 PM Andy Riebs <andy.riebs at hpe.com> wrote:

> Hi All,
>
> Just checking to see if this sounds familiar to anyone.
>
> Environment:
> - CentOS 7.5 x86_64
> - Slurm 17.11.10 (but this also happened with 17.11.5)
>
> We typically run about 100 tests/night, selected from a handful of
> favorites. For roughly 1 in 300 test runs, we see one of two mysterious
> failures:
>
> 1. The 5 minute cancellation
>
> A job will be rolling along, generating it's expected output, and then
> this message appears:
>
> srun: forcing job termination
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT
> 2019-01-30T07:35:50 ***
> srun: error: nodename: task 250: Terminated
> srun: Terminating job step 3531.0
>
> sacct reports
>
>        JobID               Start                 End ExitCode      State
> ------------ ------------------- ------------------- -------- ----------
> 3418         2019-01-29T05:54:07 2019-01-29T05:59:16      0:9     FAILED
>
> These failures consistently happen at just about 5 minutes into the run
> when they happen.
>
> 2. The random cancellation
>
> As above, a job will be generating the expected output, and then we see
>
> srun: forcing job termination
> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT
> 2019-01-30T07:35:50 ***
> srun: error: nodename: task 250: Terminated
> srun: Terminating job step 3531.0
>
> But this time, sacct reports
>
>        JobID               Start                 End ExitCode      State
> ------------ ------------------- ------------------- -------- ----------
> 3531         2019-01-30T07:21:25 2019-01-30T07:35:50      0:0  COMPLETED
> 3531.0       2019-01-30T07:21:27 2019-01-30T07:35:56     0:15  CANCELLED
>
> I think we've seen these cancellations pop up as soon as a minute or two
> into the test run, up to perhaps 20 minutes into the run.
>
> The only thing slightly unusual in our job submissions is that we use
> srun's "--immediate=120" so that the scripts can respond appropriately if a
> node goes down.
>
> With SlurmctldDebug=debug2 and SlurmdDebug=debug5, there's not a clue in
> the slurmctld or slurmd logs.
>
> Any thoughts on what might be happening, or what I might try next?
>
> Andy
>
> --
> Andy Riebsandy.riebs at hpe.com
> Hewlett-Packard Enterprise
> High Performance Computing Software Engineering
> +1 404 648 9024
> My opinions are not necessarily those of HPE
>     May the source be with you!
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190131/5ab0c68e/attachment.html>