[slurm-users] Simple free for all cluster

Marcus Wagner wagner at itc.rwth-aachen.de
Thu Oct 8 05:13:14 UTC 2020


Hi Jason,

we intend to have a maximum wallclock time of 5 days. We chose this, to have the possibility to do a timely maintenance without disturbing and killing the users jobs. Yet we see that some users and / or codes need a longer runtime. That is why we set the maxtime for the partitions to 30 days. Our users must write a proposal if they need a larger amount of core hours. There they have to justify why they need a longer runtime than 5 days. This maxtime is limited by the association created for the triple (account,user,partition). When we need to do a timely maintenance, we kill such long running jobs. Our users know that.

Our default time is set to 15 minutes.

Best
Marcus

Am 06.10.2020 um 16:53 schrieb Jason Simms:
> FWIW, I define the DefaultTime as 5 minutes, which effectively means for any "real" job that users must actually define a time. It helps users get into that habit, because in the absence of a DefaultTime, most will not even bother to think critically and carefully about what time limit is actually reasonable, which is important for, e.g., effective job backfill and scheduling estimations.
> 
> I currently don't have a MaxTime defined, because how do I know how long a job will take? Most jobs on my cluster require no more than 3-4 days, but in some cases at other campuses, I know that jobs can run for weeks. I suppose even setting a time limit such as 4 weeks would be overkill, but at least it's not infinite. I'm curious what others use as that value, and how you arrived at it.
> 
> Warmest regards,
> Jason
> 
> On Tue, Oct 6, 2020 at 5:55 AM John H <jsh at sdf.org <mailto:jsh at sdf.org>> wrote:
> 
>     Yes I hadn't considered that! Thanks for the tip, Michael I shall do that.
> 
>     John
> 
>     On Fri, Oct 02, 2020 at 01:49:44PM +0000, Renfro, Michael wrote:
>      > Depending on the users who will be on this cluster, I'd probably adjust the partition to have a defined, non-infinite MaxTime, and maybe a lower DefaultTime. Otherwise, it would be very easy for someone to start a job that reserves all cores until the nodes get rebooted, since all they have to do is submit a job with no explicit time limit (which would then use DefaultTime, which itself has a default value of MaxTime).
>      >
> 
> 
> 
> -- 
> *Jason L. Simms, Ph.D., M.P.H.*
> Manager of Research and High-Performance Computing
> XSEDE Campus Champion
> Lafayette College
> Information Technology Services
> 710 Sullivan Rd | Easton, PA 18042
> Office: 112 Skillman Library
> p: (610) 330-5632

-- 
Dipl.-Inf. Marcus Wagner

IT Center
Gruppe: Systemgruppe Linux
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de

Social Media Kanäle des IT Centers:
https://blog.rwth-aachen.de/itc/
https://www.facebook.com/itcenterrwth
https://www.linkedin.com/company/itcenterrwth
https://twitter.com/ITCenterRWTH
https://www.youtube.com/channel/UCKKDJJukeRwO0LP-ac8x8rQ

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5326 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201008/ca25f93e/attachment.bin>


More information about the slurm-users mailing list