[slurm-users] Mysterious job terminations on Slurm 17.11.10
Andy Riebs
andy.riebs at hpe.com
Thu Jan 31 19:04:45 UTC 2019
Hi All,
Just checking to see if this sounds familiar to anyone.
Environment:
- CentOS 7.5 x86_64
- Slurm 17.11.10 (but this also happened with 17.11.5)
We typically run about 100 tests/night, selected from a handful of
favorites. For roughly 1 in 300 test runs, we see one of two mysterious
failures:
1. The 5 minute cancellation
A job will be rolling along, generating it's expected output, and then
this message appears:
srun: forcing job termination
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT
2019-01-30T07:35:50 ***
srun: error: nodename: task 250: Terminated
srun: Terminating job step 3531.0
sacct reports
JobID Start End ExitCode
State
------------ ------------------- ------------------- --------
----------
3418 2019-01-29T05:54:07 2019-01-29T05:59:16 0:9 FAILED
These failures consistently happen at just about 5 minutes into the run
when they happen.
2. The random cancellation
As above, a job will be generating the expected output, and then we see
srun: forcing job termination
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 3531.0 ON nodename CANCELLED AT
2019-01-30T07:35:50 ***
srun: error: nodename: task 250: Terminated
srun: Terminating job step 3531.0
But this time, sacct reports
JobID Start End ExitCode
State
------------ ------------------- ------------------- --------
----------
3531 2019-01-30T07:21:25 2019-01-30T07:35:50 0:0 COMPLETED
3531.0 2019-01-30T07:21:27 2019-01-30T07:35:56 0:15 CANCELLED
I think we've seen these cancellations pop up as soon as a minute or two
into the test run, up to perhaps 20 minutes into the run.
The only thing slightly unusual in our job submissions is that we use
srun's "--immediate=120" so that the scripts can respond appropriately
if a node goes down.
With SlurmctldDebug=debug2 and SlurmdDebug=debug5, there's not a clue in
the slurmctld or slurmd logs.
Any thoughts on what might be happening, or what I might try next?
Andy
--
Andy Riebs
andy.riebs at hpe.com
Hewlett-Packard Enterprise
High Performance Computing Software Engineering
+1 404 648 9024
My opinions are not necessarily those of HPE
May the source be with you!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190131/d3ff6b6e/attachment.html>
More information about the slurm-users
mailing list