[slurm-users] Resolution! was Re: Mysterious job terminations on Slurm 17.11.10

Tue Mar 12 22:12:36 UTC 2019

It appears that we have gotten to the bottom of this problem! We 
discovered that we only seem to see this problem if our overnight test 
script is run with "nohup," as we have been doing for several years. 
Typically, we would see the mysterious cancellations about once every 
other day, or 3-4 times a week. In the week+ since we started using 
"tmux" instead, we haven't seen this problem at all.

On that basis, I'm declaring success!

Many thanks to Doug Meyer and Chris Samuel for jumping in to offer 
suggestions.

Andy

------------------------------------------------------------------------
*From:* Andy Riebs <andy.riebs at hpe.com>
*Sent:* Thursday, January 31, 2019 2:04PM
*To:* Slurm-users <slurm-users at schedmd.com>
*Cc:*
*Subject:* Mysterious job terminations on Slurm 17.11.10
Hi All,

Just checking to see if this sounds familiar to anyone.

Environment:
- CentOS 7.5 x86_64
- Slurm 17.11.10 (but this also happened with 17.11.5)

We typically run about 100 tests/night, selected from a handful of 
favorites. For roughly 1 in 300 test runs, we see one of two mysterious 
failures:

1. The 5 minute cancellation

A job will be rolling along, generating it's expected output, and then 
this message appears:

    srun: forcing job termination
    srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
    slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT
    2019-01-30T07:35:50 ***
    srun: error: nodename: task 250: Terminated
    srun: Terminating job step 3531.0

sacct reports

            JobID               Start End ExitCode      State
    ------------ ------------------- ------------------- --------
    ----------
    3418         2019-01-29T05:54:07 2019-01-29T05:59:16 0:9     FAILED

These failures consistently happen at just about 5 minutes into the run 
when they happen.

2. The random cancellation

As above, a job will be generating the expected output, and then we see

    srun: forcing job termination
    srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
    slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT
    2019-01-30T07:35:50 ***
    srun: error: nodename: task 250: Terminated
    srun: Terminating job step 3531.0

But this time, sacct reports

            JobID               Start End ExitCode      State
    ------------ ------------------- ------------------- --------
    ----------
    3531         2019-01-30T07:21:25 2019-01-30T07:35:50 0:0  COMPLETED
    3531.0       2019-01-30T07:21:27 2019-01-30T07:35:56 0:15  CANCELLED

I think we've seen these cancellations pop up as soon as a minute or two 
into the test run, up to perhaps 20 minutes into the run.

The only thing slightly unusual in our job submissions is that we use 
srun's "--immediate=120" so that the scripts can respond appropriately 
if a node goes down.

With SlurmctldDebug=debug2 and SlurmdDebug=debug5, there's not a clue in 
the slurmctld or slurmd logs.

Any thoughts on what might be happening, or what I might try next?

Andy

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190312/31913116/attachment-0001.html>