<div dir="ltr"><div><div><div><div><div><div>Hi All,<br><br></div>I am having an issue with jobs that end, either by an "scancel", or being killed due to job wall time timeout, or even in with srun --pty interactive shell), exiting the shell.  An excerpt from /var/log/slurmd where a typical job was running:<br><br>[2018-03-05T12:48:49.165] _run_prolog: run job script took usec=6160<br>[2018-03-05T12:48:49.165] _run_prolog: prolog with lock for job 523 ran for 0 seconds<br>[2018-03-05T12:48:49.454] launch task 523.0 request from <a href="mailto:31866.3048@192.168.246.11">31866.3048@192.168.246.11</a> (port 7405)<br>[2018-03-05T12:48:49.486] [523.0] in _window_manager<br>[2018-03-05T13:48:52.488] [523.0] error: *** STEP 523.0 ON vap0849 CANCELLED AT 2018-03-05T13:48:52 DUE TO TIME LIMIT ***<br>[2018-03-05T13:50:23.000] [523.0] error: *** STEP 523.0 STEPD TERMINATED ON vap0849 AT 2018-03-05T13:50:22 DUE TO JOB NOT ENDING WITH SIGNALS ***<br>[2018-03-05T13:50:23.000] [523.extern] error: *** EXTERN STEP FOR 523 STEPD TERMINATED ON vap0849 AT 2018-03-05T13:50:22 DUE TO JOB NOT ENDING WITH SIGNALS ***<br>[2018-03-05T13:50:23.000] [523.extern] done with job<br>[2018-03-05T13:50:23.000] [523.0] done with job<br><br><br></div>The node that the job was on hangs (does not schedule new jobs), while the job state shows "completing" in squeue. The job would eventually "time out" and an error reported i slurmd. Worse, with Slurm 17.02.2, it always caused the node to go into a Draining state. Since upgrading to 17.11.2, the error still occurs, but *usually* nodes don't go into a drained state (there is some evidence that it still occurs however). <br><br></div>The issue looks similar to <a href="https://bugs.schedmd.com/show_bug.cgi?id=3941">https://bugs.schedmd.com/show_bug.cgi?id=3941</a>, where the recommendation was to upgrade. As mentioned, it seems to usually (but maybe not always) prevent nodes from going into the DRAIN state. But the real question is, what is causing the "job not ending with signals"? Are there examples of  what should go into an "UnkillableStepProgram", if that's the solution? Slurm should be sending, e.g. SIGTERM, but then SIGKILL if needed.<br><br><br></div>Here are some following parameters that may be relevant:<br>[root@vap0843 slurm]# scontrol show config | grep -i kill<br>KillOnBadExit           = 0<br>KillWait                = 30 sec<br>UnkillableStepProgram   = (null)<br>UnkillableStepTimeout   = 60 sec<br>[root@vap0843 slurm]# scontrol show config | grep -i epilog<br>Epilog                  = (null)<br>EpilogMsgTime           = 2000 usec<br>EpilogSlurmctld         = (null)<br>PrologEpilogTimeout     = 65534<br>ResvEpilog              = (null)<br>SrunEpilog              = (null)<br>TaskEpilog              = (null)<br>[root@vap0843 slurm]# scontrol show config | grep -i cgroup<br>JobAcctGatherType       = jobacct_gather/cgroup<br>ProctrackType           = proctrack/cgroup<br>TaskPlugin              = task/cgroup<br><br></div>Many Thanks,<br></div>  Keith<br></div>