[slurm-users] [External] Re: Simple free for all cluster

Thu Oct 22 18:57:13 UTC 2020

I know I'm replying late to this party, but for what it's worth, when 
this topic was debated at my current employer, the more advanced users 
(the ones who know how to checkpoint their code, etc.) argued for 
shorter time limits. They wanted a max. runtime of only 24 hours, 
whereas the less advanced users (who didn't know how to checkpoint, or 
were running non-parallelized code), wanted really long or unlimited 
time limits.

Having short time limits both makes it easier to schedule downtime, and 
to give all users a fair crack at using the cluster (one person can't 
use 75% of the cluster for months at a time.). In the long run, both of 
these can lead to happier customers (less downtime due to more regular 
maintenance, more equitable use of the cluster), but it takes a lot of 
user education and selling them on the benefits of this to get them to 
agree.

When we first switched to a 48-hour time limit, there were definitely 
some unhappy campers, but now that it's been in place for a while, 
everyone seems to have adapted and accepted it.

Prentice

On 10/17/20 5:08 AM, John H wrote:
> Thanks Chris will likely need it :)
>
> John
>
> On Sat, Oct 10, 2020 at 04:19:06PM -0700, Chris Samuel wrote:
>> On Tuesday, 6 October 2020 7:53:02 AM PDT Jason Simms wrote:
>>
>>> I currently don't have a MaxTime defined, because how do I know how long a
>>> job will take? Most jobs on my cluster require no more than 3-4 days, but
>>> in some cases at other campuses, I know that jobs can run for weeks. I
>>> suppose even setting a time limit such as 4 weeks would be overkill, but at
>>> least it's not infinite. I'm curious what others use as that value, and how
>>> you arrived at it
>> My journey over the last 16 years in HPC has been one of decreasing time
>> limits, back in 2003 with VPAC's first Linux cluster we had no time limits, we
>> then introduced a 90 day limit so we could plan quarterly maintenances (and
>> yes, we had users who had jobs which legitimately ran longer than that, so
>> they had to learn to checkpoint).  At VLSCI we had 30 day limits (life
>> sciences, so many long running poorly scaling jobs), then when I was at
>> Swinburne it was a 7 day limit, and now here at NERSC we've got 2 day limits.
>>
>> It really is down to what your use cases are and how much influence you have
>> over your users.  It's often the HPC sysadmins responsibility to try and find
>> that balance between good utilisation, effective use of the system and reaching
>> the desired science/research/development outcomes.
>>
>> Best of luck!
>> Chris
>> -- 
>>    Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>>
>>
>>
>>
-- 
Prentice Bisbal
Lead Software Engineer
Research Computing
Princeton Plasma Physics Laboratory
http://www.pppl.gov