[slurm-users] Simple free for all cluster

Sebastian T Smith stsmith at unr.edu
Tue Oct 6 16:33:39 UTC 2020


Our MaxTime and DefaultTime are 14-days.  Setting a high DefaultTime was a convenience to our users (and the support team) but has evolved into a mistake because it impacts backfill.  Under high load we'll see small backfill jobs take over because the estimated start and end time of "DefaultTime" jobs are wildly incorrect -- the backfill algorithm is less likely to calculate a delay in larger, highest-priority jobs and backfills smaller jobs.  I've tuned many of the backfill SchedulerParameters, but there's no replacement for an accurate time estimate.

Default values also become difficult to change once hundreds of submit scripts ignore them.  Jason, I think setting a small DefaultTime limit is a good approach.  We've considered resetting our default to 1 min to force jobs to specify a time but will (likely) target an average-ish value now that we have stats from a couple of million jobs.

- Sebastian

--

[University of Nevada, Reno]<http://www.unr.edu/>
Sebastian Smith
High-Performance Computing Engineer
Office of Information Technology
1664 North Virginia Street
MS 0291

work-phone: 775-682-5050<tel:7756825050>
email: stsmith at unr.edu<mailto:stsmith at unr.edu>
website: http://rc.unr.edu<http://rc.unr.edu/>

________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Jason Simms <simmsj at lafayette.edu>
Sent: Tuesday, October 6, 2020 7:53 AM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Simple free for all cluster

FWIW, I define the DefaultTime as 5 minutes, which effectively means for any "real" job that users must actually define a time. It helps users get into that habit, because in the absence of a DefaultTime, most will not even bother to think critically and carefully about what time limit is actually reasonable, which is important for, e.g., effective job backfill and scheduling estimations.

I currently don't have a MaxTime defined, because how do I know how long a job will take? Most jobs on my cluster require no more than 3-4 days, but in some cases at other campuses, I know that jobs can run for weeks. I suppose even setting a time limit such as 4 weeks would be overkill, but at least it's not infinite. I'm curious what others use as that value, and how you arrived at it.

Warmest regards,
Jason

On Tue, Oct 6, 2020 at 5:55 AM John H <jsh at sdf.org<mailto:jsh at sdf.org>> wrote:
Yes I hadn't considered that! Thanks for the tip, Michael I shall do that.

John

On Fri, Oct 02, 2020 at 01:49:44PM +0000, Renfro, Michael wrote:
> Depending on the users who will be on this cluster, I'd probably adjust the partition to have a defined, non-infinite MaxTime, and maybe a lower DefaultTime. Otherwise, it would be very easy for someone to start a job that reserves all cores until the nodes get rebooted, since all they have to do is submit a job with no explicit time limit (which would then use DefaultTime, which itself has a default value of MaxTime).
>



--
Jason L. Simms, Ph.D., M.P.H.
Manager of Research and High-Performance Computing
XSEDE Campus Champion
Lafayette College
Information Technology Services
710 Sullivan Rd | Easton, PA 18042
Office: 112 Skillman Library
p: (610) 330-5632
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20201006/1aa7b355/attachment-0001.htm>


More information about the slurm-users mailing list