<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    <p>I would recommend putting a clean up process in your epilog
      script.  We have a check here that sees if the job completed and
      if so it then terminates all the user processes by kill -9 to
      clean up any residuals. If it fails it closes of the node so we
      can reboot it.</p>
    <p>-Paul Edmon-<br>
    </p>
    <br>
    <div class="moz-cite-prefix">On 04/23/2018 08:10 AM, John Hearns
      wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CAPqNE2XfYy8U1mTV0ZknKu1JybfouFkMB_MXZas1hfvjs1PdvA@mail.gmail.com">
      <div dir="ltr">
        <div>
          <div>
            <div>
              <div>
                <div>
                  <div>
                    <div>
                      <div>
                        <div>
                          <div>
                            <div>
                              <div>
                                <div>
                                  <div>
                                    <div>
                                      <div>
                                        <div>
                                          <div>
                                            <div>
                                              <div>
                                                <div>
                                                  <div>
                                                    <div>
                                                      <div>
                                                        <div>
                                                          <div>
                                                          <div>Nicolo, I
                                                          cannot say
                                                          what your
                                                          problem is.<br>
                                                          </div>
                                                          However in the
                                                          past with
                                                          problems like
                                                          this I would <br>
                                                          <br>
                                                          </div>
                                                          a) look at    
                                                          ps -eaf
                                                          --forest<br>
                                                        </div>
                                                        Try to see what
                                                        the parent
                                                        processes of
                                                        these job
                                                        processes are<br>
                                                      </div>
                                                      Clearly if the
                                                      parent PID is 1
                                                      then --forest is
                                                      nto much help. 
                                                      But the --forest
                                                      option is my
                                                      'goto' option<br>
                                                      <br>
                                                    </div>
                                                    b) look closely at
                                                    the slurm logs. Do
                                                    not fool yourself -
                                                    force yourself to
                                                    read the logs line
                                                    by line, around the
                                                    timestamp when the
                                                    jobs ends.<br>
                                                    <br>
                                                    <br>
                                                  </div>
                                                  Being a bit more
                                                  helpful, in my last
                                                  job we had endless
                                                  problems with Matlab
                                                  jobs leaving orphaned
                                                  processes.<br>
                                                </div>
                                                To be fair to Matlab,
                                                they have a utility
                                                which 'properly' starts
                                                parallel jobs under the
                                                control of the batch
                                                system (OK, it was
                                                PBSpro)<br>
                                              </div>
                                              But users can easily start
                                              a job and 'fire off'
                                              processes in MAtlab which
                                              are nut under the directo
                                              control of the batch
                                              daemon, leaving orphaned
                                              processes<br>
                                            </div>
                                            when the jobs ends.<br>
                                            <br>
                                          </div>
                                          Actually, if you think about
                                          this this is how a batch
                                          system works. The batch system
                                          daemon starts running
                                          processes on your behalf.<br>
                                        </div>
                                        When the job is killed, all the
                                        daughter proccesses of that
                                        daemon should die.<br>
                                      </div>
                                      It is intructive to run ps -eaf
                                      --forest  sometimes on a compute
                                      node during a normal job run. Get
                                      to know how things are being
                                      created, and what their parents
                                      are<br>
                                    </div>
                                    (two dashes in front of the forest
                                    argument)<br>
                                    <br>
                                  </div>
                                  Now think of users who start a batch
                                  job and get a list of compute hosts.<br>
                                </div>
                                they MAY use a mechanism such as ssd or
                                indeed pbsdsh to start running job
                                rocesses on those nodes.<br>
                              </div>
                              You will then have trouble with orphaned
                              processes when the job ends.<br>
                            </div>
                            Techniques for dealing with this:<br>
                          </div>
                          a use the PAM module which stops ssh login 
                          (actually - this probably allows ssh login
                          suring a job time when th euser has a node
                          allocated)<br>
                        </div>
                        b my favourite - CPU sets - actuallt this wont
                        stop ssh logins either.<br>
                      </div>
                      c Shouting, much shouting. Screaming.<br>
                      <br>
                    </div>
                    Regarding users behavng like this,  I have seen
                    several cases of behaviour like this for
                    understandable reasons.<br>
                  </div>
                  On a ssytem which I did not manage, but was asked fro
                  advice, the vendor had provided a sample script for
                  running Ansys.<br>
                </div>
                The user wanted to run Abaqus on the compute nodes (or
                some such - a different application anyway)<br>
              </div>
              So  he started an empty Ansys job, which sat doing
              nothing. Then took the list of hosts provided by the batch
              system<br>
            </div>
            and fired up an interactive Abaqus session on his terminal.<br>
          </div>
          I honestly hesitate to label this behaviour 'wrong'<br>
          <br>
        </div>
        I als have seen similar when running a CFD job.<br>
        <div><br>
          <div><br>
            <div><br>
              <br>
              <div>
                <div>
                  <div>
                    <div><br>
                      <br>
                      <br>
                      <br>
                      <br>
                      <br>
                      <br>
                      <br>
                      <br>
                      <br>
                      <br>
                      <br>
                      <br>
                      <br>
                      <div>
                        <div>
                          <div>
                            <div>
                              <div><br>
                                <br>
                                <br>
                                <br>
                                <div>
                                  <div>
                                    <div>
                                      <div>
                                        <div>
                                          <div>
                                            <div><br>
                                              <div>
                                                <div>
                                                  <div>
                                                    <div>
                                                      <div>
                                                        <div><br>
                                                          <br>
                                                          <div><br>
                                                          <br>
                                                          <br>
                                                          </div>
                                                        </div>
                                                      </div>
                                                    </div>
                                                  </div>
                                                </div>
                                              </div>
                                            </div>
                                          </div>
                                        </div>
                                      </div>
                                    </div>
                                  </div>
                                </div>
                              </div>
                            </div>
                          </div>
                        </div>
                      </div>
                    </div>
                  </div>
                </div>
              </div>
            </div>
          </div>
        </div>
      </div>
      <div class="gmail_extra"><br>
        <div class="gmail_quote">On 23 April 2018 at 11:50, Nicolò
          Parmiggiani <span dir="ltr"><<a
              href="mailto:nicolo.parmiggiani@gmail.com" target="_blank"
              moz-do-not-send="true">nicolo.parmiggiani@gmail.com</a>></span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div dir="ltr">Hi,
              <div><br>
              </div>
              <div>I have a job that keeps running even though the
                internal process is finished.</div>
              <div><br>
              </div>
              <div>What could be the problem?</div>
              <div><br>
              </div>
              <div>Thank you.</div>
            </div>
          </blockquote>
        </div>
        <br>
      </div>
    </blockquote>
    <br>
  </body>
</html>