[slurm-users] Job canceled after reaching QOS limits for CPU time.

Zacarias Benta zacarias at lip.pt
Fri Oct 30 13:38:11 UTC 2020


Hi Loris,

Thanks for taking the time to reply to my message.
We are indeed wanting to limit and not limit at the same time, I know 
that it is kind of tricky, but let me try to explain.
Our hpc center currently limits jobs from running for more than 5 days 
straight when users submit single core jobs, 4 days for multi-core jobs 
and 18 days for very controlled and special requests.
We will receive a new batch of users that have applied for a grant that 
allows them to submit jobs to our cluster. They had to submit a 
proposal, proving the scientific validity of their project and 
describing the resources they would need (cpu time, nr of cores, 
specific software, ... )  Since we know from experience that most of the 
new users have no idea of how much time the jobs will take to run, we're 
worried that some of them have underestimated their resource consumption 
needs and  end up submitting a job that will run until the limit is 
reached and then get cancelled, making their effort and consumption of 
our cpu time worthless.
What would be great is if the job having reached the limit, would stay 
in a pending state, we would then discuss with the users, show them the 
options and then decide to increase the cpu time(that is, within 
reasonable values) or kill the job.
I know it sound kind o silly giving a limit and at the same time 
allowing for exceptions, but we are trying to prevent the waste of 
valuable cpu time.


On 30/10/2020 09:34, Loris Bennett wrote:

> Hi Zacarias,
>
> Zacarias Benta <zacarias at lip.pt> writes:
>
>> Good morning everyone.
>>
>> I'm having a "issue", I don't know if it is a "bug or a feature".
>> I've created a QOS: "sacctmgr add qos myqos set GrpTRESMins=cpu=10
>> flags=NoDecay".  I know the limit it too low, but I just wanted to
>> give you guys an example.  Whenever a user submits a job and uses this
>> QOS, if the job reaches the limit I've defined, the job is canceled
>> and I loose and the computation it had done so far.  Is it possible to
>> create a QOS/slurm setting that when the users reach the limit, it
>> changes the job state to pending?  This way I can increase the limits,
>> change the job state to Runnig so it can continue until it reaches
>> completion.  I know this is a little bit odd, but I have users that
>> have requested cpu time as per an agreement between our HPC center and
>> their institutions. I know limits are set so they can be enforced,
>> what I'm trying to prevent is for example, a person having a job
>> running for 2 months and at the end not having any data because they
>> just needed a few more days. This could be prevented if I could grant
>> them a couple more days of cpu, if the job went on to a pending state
>> after reaching the limit.
> I'm not sure there is a solution to your problem.  You want to both
> limit the time jobs can run and also not limit it.  How much more time
> do you want to give a job which has reached its limit?  A fixed time?  A
> percentage of the time used up to now?  What happens if two months plus
> a few more days is not enough and the job needs a few more days?
>
> The longer you allow jobs to run, the more CPU is lost when jobs fail to
> complete, the sadder users then are.  In addition the longer jobs run,
> the more likely they are to fall victim to hardware failure and the less
> able you are to perform administrative task which require a down-time.
> We run a university cluster with an upper time-limit of 14 days, which I
> consider fairly long, and occasionally extend individual jobs on a
> case-by-case basis.  For our users this seems to work fine.
>
> If your job need months, you are in general using the wrong software
> or using the software wrong.  There may be exceptions to this, but in my
> experience, these are few and far between.
>
> So my advice would be to try to convince your users that shorter
> run-times are in fact better for them and only by happy accident also
> better for you.
>
> Just my 2¢.
>
> Cheers,
>
> Loris
>
>> Cumprimentos / Best Regards,
>>
>> Zacarias Benta
>> INCD @ LIP - Universidade do Minho
>>
>> INCD Logo
>>
-- 

*Cumprimentos / Best Regards,*

Zacarias Benta
INCD @ LIP - Universidade do Minho

INCD Logo

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201030/8802b3cf/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4356 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201030/8802b3cf/attachment.bin>


More information about the slurm-users mailing list