<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p>Slurm is trying to kill the job that is exceeding it's time
      limit, but the job doesn't die, so Slurm marks the node down
      because it sees this as a problem with the node. Increasing the
      value for GraceTime or  KillWait might help:</p>
    <p><br>
    </p>
    <p>
      <blockquote type="cite"><dt><b>GraceTime</b></dt>
        <dd>
          Specifies, in units of seconds, the preemption grace time
          to be extended to a job which has been selected for
          preemption.
          The default value is zero, no preemption grace time is allowed
          on
          this partition.
          Once a job has been selected for preemption, its end time is
          set to the current
          time plus GraceTime. The job's tasks are immediately sent
          SIGCONT and SIGTERM
          signals in order to provide notification of its imminent
          termination.
          This is followed by the SIGCONT, SIGTERM and SIGKILL signal
          sequence upon
          reaching its new end time. This second set of signals is sent
          to both the
          tasks <b>and</b> the containing batch script, if applicable.
          Meaningful only for PreemptMode=CANCEL.
          See also the global <b>KillWait</b> configuration parameter.
        </dd>
      </blockquote>
      <br>
    </p>
    <blockquote type="cite"><dt><b>KillWait</b></dt>
      <dd>
        The interval, in seconds, given to a job's processes between the
        SIGTERM and SIGKILL signals upon reaching its time limit.
        If the job fails to terminate gracefully in the interval
        specified,
        it will be forcibly terminated.
        The default value is 30 seconds.
        The value may not exceed 65533.
      </dd>
    </blockquote>
    <p><br>
    </p>
    <p>--<br>
      Prentice<br>
    </p>
    <br>
    <div class="moz-cite-prefix">On 3/19/19 7:21 AM, Taras Shapovalov
      wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAJr6v3GYRpvWmTGVSR1vKX9zvXTVtvkebDK+CD3MsVD1yH+KKw@mail.gmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="ltr">Hey guys,
        <div><br>
        </div>
        <div>When a job max time is exceeded, then Slurm tries to kill
          the job and fails:<br>
        </div>
        <div><br>
        </div>
        <div><span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px">[2019-03-15T09:44:03.589]
            sched: _slurm_rpc_allocate_resources JobId=1325
            NodeList=rn003 usec=355 </span><br
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px">
          <span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px">[2019-03-15T09:44:03.928]
            prolog_running_decr: Configuration for JobID=1325 is
            complete </span><br
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px">
          <span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px">[2019-03-15T09:45:12.739]
            Time limit exhausted for JobId=1325 </span><br
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px">
          <span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px">[2019-03-15T09:45:44.001]
            _slurm_rpc_complete_job_allocation: JobID=1325 State=0x8006
            NodeCnt=1 error Job/step already completing or completed </span><br
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px">
          <span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px">[2019-03-15T09:46:12.805]
            Resending TERMINATE_JOB request JobId=1325 Nodelist=rn003 </span><br
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px">
          <span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px">[2019-03-15T09:48:43.000]
            update_node: node rn003 reason set to: Kill task failed </span><br
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px">
          <span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px">[2019-03-15T09:48:43.000]
            update_node: node rn003 state set to DRAINING </span><br
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px">
          <span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px">[2019-03-15T09:48:43.000]
            got (nil) </span><br
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px">
          <span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px">[2019-03-15T09:48:43.816]
            cleanup_completing: job 1325 completion process took 211
            seconds </span><br>
        </div>
        <div><span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px"><br>
          </span></div>
        <div><span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px">This happens even on
            very simple "srun bash" jobs that exceed their time limits.
            Have you idea what does it mean? Upgrade to the latest did
            not help.</span></div>
        <div><span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px"><br>
          </span></div>
        <div><span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px"><br>
          </span></div>
        <div><span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px">Best regards,</span></div>
        <div><span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px"><br>
          </span></div>
        <div><span
style="color:rgb(23,43,77);font-family:-apple-system,BlinkMacSystemFont,"Segoe
            UI",Roboto,Oxygen,Ubuntu,"Fira
            Sans","Droid Sans","Helvetica
            Neue",sans-serif;font-size:14px">Taras</span></div>
      </div>
    </blockquote>
    <pre class="moz-signature" cols="72">
</pre>
  </body>
</html>