[slurm-users] Quickly throttling/limiting a specific user's jobs

Wed Sep 23 16:16:13 UTC 2020

I've used Paul's `MaxJobs` suggestion in emergencies with success.  +1 vote.

We've encountered RPC timeouts and have been able to tune the `sched_max_job_start` (decrease) and `sched_min_interval` (increase) options of `SchedulerParameters` to reduce/eliminate timeouts during high job flux.  Selecting good values required some trial and error.

Good luck!

Sebastian

--

[University of Nevada, Reno]<http://www.unr.edu/>
Sebastian Smith
High-Performance Computing Engineer
Office of Information Technology
1664 North Virginia Street
MS 0291

work-phone: 775-682-5050<tel:7756825050>
email: stsmith at unr.edu<mailto:stsmith at unr.edu>
website: http://rc.unr.edu<http://rc.unr.edu/>

________________________________
From: slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf of Paul Edmon <pedmon at cfa.harvard.edu>
Sent: Tuesday, September 22, 2020 5:01 PM
To: slurm-users at lists.schedmd.com <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] Quickly throttling/limiting a specific user's jobs

I would look at:

MaxJobs=<max jobs>
Maximum number of jobs each user is allowed to run at one time in this association. This is overridden if set directly on a user. Default is the cluster's limit. To clear a previously set value use the modify command with a new value of -1.

Which is Association based.  So you could just modify their account directly and set it to something low.

You can also simply put their pending jobs in hold state.  That way they won't be considered for scheduling but won't be outright removed.  Setting fairshare to 0 has the same effect.

-Paul Edmon-

On 9/22/2020 7:58 PM, Brian Andrus wrote:

Well, I know of no way to 'throttle' running jobs. Once they are out the gate, you can't stop them from leaving..

That said, your approach of setting arraytaskthrottle is just what you want for any pending jobs.

As a preventative measure, I imagine you could set the default to 1 and then change it with a job_submit script.

As far as currently running tasks, well, you have to figure that. You could kill/requeue them, but that can break things for the user. If their code supports it, they could checkpoint/restart as part of the process.

You can suspend them, but they still sit on the node waiting to be resumed, but the node resources may get assigned to other jobs while they wait to resume.

Brian Andrus

On 9/22/2020 2:33 PM, Ransom, Geoffrey M. wrote:
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200923/fcfe1997/attachment.htm>