[slurm-users] Job canceled after reaching QOS limits for CPU time.

Fri Oct 30 15:02:32 UTC 2020

And also the DMTCP project.

On 30/10/2020 14:10, Thomas M. Payerle wrote:
>
>
> On Fri, Oct 30, 2020 at 5:37 AM Loris Bennett 
> <loris.bennett at fu-berlin.de <mailto:loris.bennett at fu-berlin.de>> wrote:
>
>     Hi Zacarias,
>
>     Zacarias Benta <zacarias at lip.pt <mailto:zacarias at lip.pt>> writes:
>
>     > Good morning everyone.
>     >
>     > I'm having a "issue", I don't know if it is a "bug or a feature".
>     > I've created a QOS: "sacctmgr add qos myqos set GrpTRESMins=cpu=10
>     > flags=NoDecay".  I know the limit it too low, but I just wanted to
>     > give you guys an example.  Whenever a user submits a job and
>     uses this
>     > QOS, if the job reaches the limit I've defined, the job is canceled
>     > and I loose and the computation it had done so far.  Is it
>     possible to
>     > create a QOS/slurm setting that when the users reach the limit, it
>     > changes the job state to pending?  This way I can increase the
>     limits,
>     > change the job state to Runnig so it can continue until it reaches
>     > completion.  I know this is a little bit odd, but I have users that
>     > have requested cpu time as per an agreement between our HPC
>     center and
>     > their institutions. I know limits are set so they can be enforced,
>     > what I'm trying to prevent is for example, a person having a job
>     > running for 2 months and at the end not having any data because they
>     > just needed a few more days. This could be prevented if I could
>     grant
>     > them a couple more days of cpu, if the job went on to a pending
>     state
>     > after reaching the limit.
>
> Your "pending" suggestion does not really make sense.  A pending job 
> is no longer attached
> to a node, it is in the queue.  It sounds like you are trying to 
> "suspend" the job, e.g. ctrl-Z it in most shells, so that it is no 
> longer using CPU.  But even that would have it consuming RAM, which on 
> many clusters would be a serious problem.
>
> Slurm supports a "grace-period" for walltime., the OverTimeLimit 
> parameter.  I have not used it, but might be what you want.  From web docs
> *OverTimeLimit* - Amount by which a job can exceed its time limit 
> before it is killed. A system-wide configuration parameter.
> I believe if a job has a 1 day time limit, and OVerTimeLimit is 1 
> hour, the job effectively gets 25 hours before it is terminated.
>
> You also should look into getting your users to checkpoint jobs (as 
> hard as educating users is).  I.e., jobs, especially large or long 
> running jobs, should periodically save their state to a file.  That 
> way, if job is terminated before it is complete for any reason (from 
> time limits to failed hardware to power outages, etc), it should be 
> able to resume from the last checkpoint.  So if job check points every 
> 6 hours, it should not lose more than about 6 hours of runtime should 
> it terminate prematurely. This sort of is the "pending" solution you 
> referred to; the job dies, but can be restarted/requeued with 
> additional time and more or less start up from where it left off.
> Some applications support checkpointing natively, and there are 
> libraries/packages like dmtcp which can do more systemy checkpointing.
>
>
>     I'm not sure there is a solution to your problem.  You want to both
>     limit the time jobs can run and also not limit it.  How much more time
>     do you want to give a job which has reached its limit?  A fixed
>     time?  A
>     percentage of the time used up to now?  What happens if two months
>     plus
>     a few more days is not enough and the job needs a few more days?
>
>     The longer you allow jobs to run, the more CPU is lost when jobs
>     fail to
>     complete, the sadder users then are.  In addition the longer jobs run,
>     the more likely they are to fall victim to hardware failure and
>     the less
>     able you are to perform administrative task which require a down-time.
>     We run a university cluster with an upper time-limit of 14 days,
>     which I
>     consider fairly long, and occasionally extend individual jobs on a
>     case-by-case basis.  For our users this seems to work fine.
>
>     If your job need months, you are in general using the wrong software
>     or using the software wrong.  There may be exceptions to this, but
>     in my
>     experience, these are few and far between.
>
>     So my advice would be to try to convince your users that shorter
>     run-times are in fact better for them and only by happy accident also
>     better for you.
>
>     Just my 2¢.
>
>     Cheers,
>
>     Loris
>
>     >
>     > Cumprimentos / Best Regards,
>     >
>     > Zacarias Benta
>     > INCD @ LIP - Universidade do Minho
>     >
>     > INCD Logo
>     >
>     -- 
>     Dr. Loris Bennett (Mr.)
>     ZEDAT, Freie Universität Berlin         Email
>     loris.bennett at fu-berlin.de <mailto:loris.bennett at fu-berlin.de>
>
>
>
> -- 
> Tom Payerle
> DIT-ACIGS/Mid-Atlantic Crossroads payerle at umd.edu <mailto:payerle at umd.edu>
> 5825 University Research Park               (301) 405-6135
> University of Maryland
> College Park, MD 20740-3831
-- 

*Cumprimentos / Best Regards,*

Zacarias Benta
INCD @ LIP - Universidade do Minho

INCD Logo

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201030/f9f453b3/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4356 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201030/f9f453b3/attachment.bin>