<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>And also the DMTCP project.<br>

    </p>

    <div class="moz-cite-prefix">On 30/10/2020 14:10, Thomas M. Payerle

      wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAHJ2ZQ-A6QXoL9d8s5EJRK6aRm=2tH5jJ_1J5EaCvjYRcxK9PA@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div dir="ltr">

        <div dir="ltr"><br>

        </div>

        <br>

        <div class="gmail_quote">

          <div dir="ltr" class="gmail_attr">On Fri, Oct 30, 2020 at 5:37

            AM Loris Bennett <<a

              href="mailto:loris.bennett@fu-berlin.de"

              moz-do-not-send="true">loris.bennett@fu-berlin.de</a>>

            wrote:<br>

          </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px

            0.8ex;border-left:1px solid

            rgb(204,204,204);padding-left:1ex">Hi Zacarias,<br>

            <br>

            Zacarias Benta <<a href="mailto:zacarias@lip.pt"

              target="_blank" moz-do-not-send="true">zacarias@lip.pt</a>>

            writes:<br>

            <br>

            > Good morning everyone.<br>

            ><br>

            > I'm having a "issue", I don't know if it is a "bug or a

            feature".<br>

            > I've created a QOS: "sacctmgr add qos myqos set

            GrpTRESMins=cpu=10<br>

            > flags=NoDecay".  I know the limit it too low, but I

            just wanted to<br>

            > give you guys an example.  Whenever a user submits a

            job and uses this<br>

            > QOS, if the job reaches the limit I've defined, the job

            is canceled<br>

            > and I loose and the computation it had done so far.  Is

            it possible to<br>

            > create a QOS/slurm setting that when the users reach

            the limit, it<br>

            > changes the job state to pending?  This way I can

            increase the limits,<br>

            > change the job state to Runnig so it can continue until

            it reaches<br>

            > completion.  I know this is a little bit odd, but I

            have users that<br>

            > have requested cpu time as per an agreement between our

            HPC center and<br>

            > their institutions. I know limits are set so they can

            be enforced,<br>

            > what I'm trying to prevent is for example, a person

            having a job<br>

            > running for 2 months and at the end not having any data

            because they<br>

            > just needed a few more days. This could be prevented if

            I could grant<br>

            > them a couple more days of cpu, if the job went on to a

            pending state<br>

            > after reaching the limit.<br>

          </blockquote>

          <div>Your "pending" suggestion does not really make sense.  A

            pending job is no longer attached <br>

          </div>

          <div>to a node, it is in the queue.  It sounds like you are

            trying to "suspend" the job, e.g. ctrl-Z it in most shells,

            so that it is no longer using CPU.  But even that would have

            it consuming RAM, which on many clusters would be a serious

            problem.</div>

          <div><br>

          </div>

          <div>Slurm supports a "grace-period" for walltime., the

            OverTimeLimit parameter.  I have not used it, but might be

            what you want.  From web docs<br>

          </div>

          <div><b>OverTimeLimit</b> - Amount by which a job can exceed

            its time limit

            before it is killed. A system-wide configuration parameter.</div>

          <div>I believe if a job has a 1 day time limit, and

            OVerTimeLimit is 1 hour, the job effectively gets 25 hours

            before it is terminated.</div>

          <div><br>

          </div>

          <div>You also should look into getting your users to

            checkpoint jobs (as hard as educating users is).  I.e.,

            jobs, especially large or long running jobs, should

            periodically save their state to a file.  That way, if job

            is terminated before it is complete for any reason (from

            time limits to failed hardware to power outages, etc), it

            should be able to resume from the last checkpoint.  So if

            job check points every 6 hours, it should not lose more than

            about 6 hours of runtime should it terminate prematurely. 

            This sort of is the "pending" solution you referred to; the

            job dies, but can be restarted/requeued with additional time

            and more or less start up from where it left off.</div>

          <div>Some applications support checkpointing natively, and

            there are libraries/packages like dmtcp which can do more

            systemy checkpointing.<br>

          </div>

          <div><br>

          </div>

          <div> </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px

            0.8ex;border-left:1px solid

            rgb(204,204,204);padding-left:1ex">

            <br>

            I'm not sure there is a solution to your problem.  You want

            to both<br>

            limit the time jobs can run and also not limit it.  How much

            more time<br>

            do you want to give a job which has reached its limit?  A

            fixed time?  A<br>

            percentage of the time used up to now?  What happens if two

            months plus<br>

            a few more days is not enough and the job needs a few more

            days?<br>

            <br>

            The longer you allow jobs to run, the more CPU is lost when

            jobs fail to<br>

            complete, the sadder users then are.  In addition the longer

            jobs run,<br>

            the more likely they are to fall victim to hardware failure

            and the less<br>

            able you are to perform administrative task which require a

            down-time.<br>

            We run a university cluster with an upper time-limit of 14

            days, which I<br>

            consider fairly long, and occasionally extend individual

            jobs on a<br>

            case-by-case basis.  For our users this seems to work fine.<br>

            <br>

            If your job need months, you are in general using the wrong

            software<br>

            or using the software wrong.  There may be exceptions to

            this, but in my<br>

            experience, these are few and far between.<br>

            <br>

            So my advice would be to try to convince your users that

            shorter<br>

            run-times are in fact better for them and only by happy

            accident also<br>

            better for you.<br>

            <br>

            Just my 2¢.<br>

            <br>

            Cheers,<br>

            <br>

            Loris<br>

            <br>

            ><br>

            > Cumprimentos / Best Regards,<br>

            ><br>

            > Zacarias Benta<br>

            > INCD @ LIP - Universidade do Minho<br>

            ><br>

            > INCD Logo<br>

            ><br>

            -- <br>

            Dr. Loris Bennett (Mr.)<br>

            ZEDAT, Freie Universität Berlin         Email <a

              href="mailto:loris.bennett@fu-berlin.de" target="_blank"

              moz-do-not-send="true">loris.bennett@fu-berlin.de</a><br>

            <br>

          </blockquote>

        </div>

        <br clear="all">

        <br>

        -- <br>

        <div dir="ltr" class="gmail_signature">

          <div dir="ltr">

            <div>

              <div dir="ltr">

                <div>

                  <div dir="ltr">Tom Payerle <br>

                    DIT-ACIGS/Mid-Atlantic Crossroads        <a

                      href="mailto:payerle@umd.edu" target="_blank"

                      moz-do-not-send="true">payerle@umd.edu</a><br>

                  </div>

                  <div>5825 University Research Park               (301)

                    405-6135<br>

                  </div>

                  <div dir="ltr">University of Maryland<br>

                    College Park, MD 20740-3831<br>

                  </div>

                </div>

              </div>

            </div>

          </div>

        </div>

      </div>

    </blockquote>

    <div class="moz-signature">-- <br>

      <p>

        <b>Cumprimentos / Best Regards,</b></p>

      Zacarias Benta<br>

      INCD @ LIP - Universidade do Minho<br>

      <br>

      <p>

        <img src="https://www.incd.pt/img/incd-dark-logo.png" alt="INCD

          Logo" width="181" height="93"> </p>

    </div>

  </body>

</html>