[slurm-users] "stepd terminated due to job not ending with signals"

Tue Mar 6 09:13:23 MST 2018

Hi All,

I am having an issue with jobs that end, either by an "scancel", or being
killed due to job wall time timeout, or even in with srun --pty interactive
shell), exiting the shell.  An excerpt from /var/log/slurmd where a typical
job was running:

[2018-03-05T12:48:49.165] _run_prolog: run job script took usec=6160
[2018-03-05T12:48:49.165] _run_prolog: prolog with lock for job 523 ran for
0 seconds
[2018-03-05T12:48:49.454] launch task 523.0 request from
31866.3048 at 192.168.246.11 (port 7405)
[2018-03-05T12:48:49.486] [523.0] in _window_manager
[2018-03-05T13:48:52.488] [523.0] error: *** STEP 523.0 ON vap0849
CANCELLED AT 2018-03-05T13:48:52 DUE TO TIME LIMIT ***
[2018-03-05T13:50:23.000] [523.0] error: *** STEP 523.0 STEPD TERMINATED ON
vap0849 AT 2018-03-05T13:50:22 DUE TO JOB NOT ENDING WITH SIGNALS ***
[2018-03-05T13:50:23.000] [523.extern] error: *** EXTERN STEP FOR 523 STEPD
TERMINATED ON vap0849 AT 2018-03-05T13:50:22 DUE TO JOB NOT ENDING WITH
SIGNALS ***
[2018-03-05T13:50:23.000] [523.extern] done with job
[2018-03-05T13:50:23.000] [523.0] done with job

The node that the job was on hangs (does not schedule new jobs), while the
job state shows "completing" in squeue. The job would eventually "time out"
and an error reported i slurmd. Worse, with Slurm 17.02.2, it always caused
the node to go into a Draining state. Since upgrading to 17.11.2, the error
still occurs, but *usually* nodes don't go into a drained state (there is
some evidence that it still occurs however).

The issue looks similar to https://bugs.schedmd.com/show_bug.cgi?id=3941,
where the recommendation was to upgrade. As mentioned, it seems to usually
(but maybe not always) prevent nodes from going into the DRAIN state. But
the real question is, what is causing the "job not ending with signals"?
Are there examples of  what should go into an "UnkillableStepProgram", if
that's the solution? Slurm should be sending, e.g. SIGTERM, but then
SIGKILL if needed.

Here are some following parameters that may be relevant:
[root at vap0843 slurm]# scontrol show config | grep -i kill
KillOnBadExit           = 0
KillWait                = 30 sec
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
[root at vap0843 slurm]# scontrol show config | grep -i epilog
Epilog                  = (null)
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
PrologEpilogTimeout     = 65534
ResvEpilog              = (null)
SrunEpilog              = (null)
TaskEpilog              = (null)
[root at vap0843 slurm]# scontrol show config | grep -i cgroup
JobAcctGatherType       = jobacct_gather/cgroup
ProctrackType           = proctrack/cgroup
TaskPlugin              = task/cgroup

Many Thanks,
  Keith
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180306/69cef7b6/attachment.html>