[slurm-users] Job canceled after reaching QOS limits for CPU time.

Loris Bennett loris.bennett at fu-berlin.de
Fri Oct 30 09:34:47 UTC 2020


Hi Zacarias,

Zacarias Benta <zacarias at lip.pt> writes:

> Good morning everyone.
>
> I'm having a "issue", I don't know if it is a "bug or a feature".
> I've created a QOS: "sacctmgr add qos myqos set GrpTRESMins=cpu=10
> flags=NoDecay".  I know the limit it too low, but I just wanted to
> give you guys an example.  Whenever a user submits a job and uses this
> QOS, if the job reaches the limit I've defined, the job is canceled
> and I loose and the computation it had done so far.  Is it possible to
> create a QOS/slurm setting that when the users reach the limit, it
> changes the job state to pending?  This way I can increase the limits,
> change the job state to Runnig so it can continue until it reaches
> completion.  I know this is a little bit odd, but I have users that
> have requested cpu time as per an agreement between our HPC center and
> their institutions. I know limits are set so they can be enforced,
> what I'm trying to prevent is for example, a person having a job
> running for 2 months and at the end not having any data because they
> just needed a few more days. This could be prevented if I could grant
> them a couple more days of cpu, if the job went on to a pending state
> after reaching the limit.

I'm not sure there is a solution to your problem.  You want to both
limit the time jobs can run and also not limit it.  How much more time
do you want to give a job which has reached its limit?  A fixed time?  A
percentage of the time used up to now?  What happens if two months plus
a few more days is not enough and the job needs a few more days?

The longer you allow jobs to run, the more CPU is lost when jobs fail to
complete, the sadder users then are.  In addition the longer jobs run,
the more likely they are to fall victim to hardware failure and the less
able you are to perform administrative task which require a down-time.
We run a university cluster with an upper time-limit of 14 days, which I
consider fairly long, and occasionally extend individual jobs on a
case-by-case basis.  For our users this seems to work fine.

If your job need months, you are in general using the wrong software
or using the software wrong.  There may be exceptions to this, but in my
experience, these are few and far between.

So my advice would be to try to convince your users that shorter
run-times are in fact better for them and only by happy accident also
better for you.

Just my 2¢.

Cheers,

Loris

>
> Cumprimentos / Best Regards,
>
> Zacarias Benta
> INCD @ LIP - Universidade do Minho
>
> INCD Logo
>
-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin         Email loris.bennett at fu-berlin.de



More information about the slurm-users mailing list