[slurm-users] Job canceled after reaching QOS limits for CPU time.
Zacarias Benta
zacarias at lip.pt
Fri Oct 30 15:02:32 UTC 2020
And also the DMTCP project.
On 30/10/2020 14:10, Thomas M. Payerle wrote:
>
>
> On Fri, Oct 30, 2020 at 5:37 AM Loris Bennett
> <loris.bennett at fu-berlin.de <mailto:loris.bennett at fu-berlin.de>> wrote:
>
> Hi Zacarias,
>
> Zacarias Benta <zacarias at lip.pt <mailto:zacarias at lip.pt>> writes:
>
> > Good morning everyone.
> >
> > I'm having a "issue", I don't know if it is a "bug or a feature".
> > I've created a QOS: "sacctmgr add qos myqos set GrpTRESMins=cpu=10
> > flags=NoDecay". I know the limit it too low, but I just wanted to
> > give you guys an example. Whenever a user submits a job and
> uses this
> > QOS, if the job reaches the limit I've defined, the job is canceled
> > and I loose and the computation it had done so far. Is it
> possible to
> > create a QOS/slurm setting that when the users reach the limit, it
> > changes the job state to pending? This way I can increase the
> limits,
> > change the job state to Runnig so it can continue until it reaches
> > completion. I know this is a little bit odd, but I have users that
> > have requested cpu time as per an agreement between our HPC
> center and
> > their institutions. I know limits are set so they can be enforced,
> > what I'm trying to prevent is for example, a person having a job
> > running for 2 months and at the end not having any data because they
> > just needed a few more days. This could be prevented if I could
> grant
> > them a couple more days of cpu, if the job went on to a pending
> state
> > after reaching the limit.
>
> Your "pending" suggestion does not really make sense. A pending job
> is no longer attached
> to a node, it is in the queue. It sounds like you are trying to
> "suspend" the job, e.g. ctrl-Z it in most shells, so that it is no
> longer using CPU. But even that would have it consuming RAM, which on
> many clusters would be a serious problem.
>
> Slurm supports a "grace-period" for walltime., the OverTimeLimit
> parameter. I have not used it, but might be what you want. From web docs
> *OverTimeLimit* - Amount by which a job can exceed its time limit
> before it is killed. A system-wide configuration parameter.
> I believe if a job has a 1 day time limit, and OVerTimeLimit is 1
> hour, the job effectively gets 25 hours before it is terminated.
>
> You also should look into getting your users to checkpoint jobs (as
> hard as educating users is). I.e., jobs, especially large or long
> running jobs, should periodically save their state to a file. That
> way, if job is terminated before it is complete for any reason (from
> time limits to failed hardware to power outages, etc), it should be
> able to resume from the last checkpoint. So if job check points every
> 6 hours, it should not lose more than about 6 hours of runtime should
> it terminate prematurely. This sort of is the "pending" solution you
> referred to; the job dies, but can be restarted/requeued with
> additional time and more or less start up from where it left off.
> Some applications support checkpointing natively, and there are
> libraries/packages like dmtcp which can do more systemy checkpointing.
>
>
> I'm not sure there is a solution to your problem. You want to both
> limit the time jobs can run and also not limit it. How much more time
> do you want to give a job which has reached its limit? A fixed
> time? A
> percentage of the time used up to now? What happens if two months
> plus
> a few more days is not enough and the job needs a few more days?
>
> The longer you allow jobs to run, the more CPU is lost when jobs
> fail to
> complete, the sadder users then are. In addition the longer jobs run,
> the more likely they are to fall victim to hardware failure and
> the less
> able you are to perform administrative task which require a down-time.
> We run a university cluster with an upper time-limit of 14 days,
> which I
> consider fairly long, and occasionally extend individual jobs on a
> case-by-case basis. For our users this seems to work fine.
>
> If your job need months, you are in general using the wrong software
> or using the software wrong. There may be exceptions to this, but
> in my
> experience, these are few and far between.
>
> So my advice would be to try to convince your users that shorter
> run-times are in fact better for them and only by happy accident also
> better for you.
>
> Just my 2¢.
>
> Cheers,
>
> Loris
>
> >
> > Cumprimentos / Best Regards,
> >
> > Zacarias Benta
> > INCD @ LIP - Universidade do Minho
> >
> > INCD Logo
> >
> --
> Dr. Loris Bennett (Mr.)
> ZEDAT, Freie Universität Berlin Email
> loris.bennett at fu-berlin.de <mailto:loris.bennett at fu-berlin.de>
>
>
>
> --
> Tom Payerle
> DIT-ACIGS/Mid-Atlantic Crossroads payerle at umd.edu <mailto:payerle at umd.edu>
> 5825 University Research Park (301) 405-6135
> University of Maryland
> College Park, MD 20740-3831
--
*Cumprimentos / Best Regards,*
Zacarias Benta
INCD @ LIP - Universidade do Minho
INCD Logo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201030/f9f453b3/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4356 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201030/f9f453b3/attachment.bin>
More information about the slurm-users
mailing list