[slurm-users] Quickly throttling/limiting a specific user's jobs
toomuchit at gmail.com
Tue Sep 22 23:58:41 UTC 2020
Well, I know of no way to 'throttle' running jobs. Once they are out the
gate, you can't stop them from leaving..
That said, your approach of setting arraytaskthrottle is just what you
want for any pending jobs.
As a preventative measure, I imagine you could set the default to 1 and
then change it with a job_submit script.
As far as currently running tasks, well, you have to figure that. You
could kill/requeue them, but that can break things for the user. If
their code supports it, they could checkpoint/restart as part of the
You can suspend them, but they still sit on the node waiting to be
resumed, but the node resources may get assigned to other jobs while
they wait to resume.
On 9/22/2020 2:33 PM, Ransom, Geoffrey M. wrote:
> We had a user post a large number of array jobs with a short actual
> run time (20-80 seconds, but mostly to the low end) and slurmctld was
> falling behind on RPC calls trying to handle the jobs. It was a bit
> awkward trying to slap arraytaskthrottle=5 on each of the queued array
> jobs while slurmctld was having issues handling the RPC load.
> I’m looking to make a QOS with MaxJobsPerUser=50 set that I can
> quickly add to a user to throttle their jobs but..
> 1)Adding a QOS to the user does not affect queued jobs so I still have
> to get all of the users jobids and modify each on directly.
> 2)I queued up a test job with the QOS set and it is still running 100
> jobs at a time (what I set arraytaskthrottle to in the job) and not
> limiting the “user” to 50 jobs.
> 3)I tried adding the FLAG OverPartQOS to see if that changed the
> behavior, but it did not seem to do anything. My test cluster I ran
> this on doesn’t have any other QOSes defined but our production
> cluster does have a partition QOS in place limiting single users to
> about 80% of the CPUs with MaxTRESPerUser.
> Is there a quick way to limit how many jobs a specific user can run at
> one time on the cluster or in a partition if we need to throttle them
> back in an emergency but we don’t want to flat out kill their jobs?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users