[slurm-users] Single user consuming all resources of the cluster
Matteo F
mfasco984 at gmail.com
Tue Feb 6 23:28:53 MST 2018
Thanks Bill, I really appreciate the time you spent giving this detailed
answer.
I will have a look at the plugin system as the integration with out
accounting system would be a nice feature.
@Chris thanks, I've had a look GrpTRES but I'll probably go with the Spank
route.
Best,
Matteo
On 6 February 2018 at 13:58, Bill Barth <bbarth at tacc.utexas.edu> wrote:
> Chris probably gives the Slurm-iest way to do this, but we use a Spank
> plugin that counts the jobs that a user has in queue (running and waiting)
> and sets a hard cap on how many they can have. This should probably be
> scaled to the size of the system and the partition they are submitting to,
> but on Stampede 2 (4200 KNL nodes and 1736 SKX nodes), we set this, across
> all queues to about 50, which has been our magic number, across numerous
> schedulers over the years on systems ranging from hundreds of nodes to
> Stamped2e 1 with 6400. Some users get more by request and most don’t even
> bump up against the limits. We’ve started to look at using TRES on our test
> system, but we haven’t gotten there yet. Our use of the DB is minimal, and
> our process to get every user into it when their TACC account is created is
> not 100% automated yet (we use the job completion plugin to create a flat
> file with job records which our local accounting system consumes to
> decrement allocation balances, if you care to know).
>
> Best,
> Bill.
>
> --
> Bill Barth, Ph.D., Director, HPC
> bbarth at tacc.utexas.edu | Phone: (512) 232-7069
> Office: ROC 1.435 | Fax: (512) 475-9445
>
>
>
> On 2/6/18, 6:03 AM, "slurm-users on behalf of Christopher Samuel" <
> slurm-users-bounces at lists.schedmd.com on behalf of chris at csamuel.org>
> wrote:
>
> On 06/02/18 21:40, Matteo F wrote:
>
> > I've tried to limit the number of running job using Qos ->
> > MaxJobsPerAccount, but this wouldn't stop a user to just fill up the
> > cluster with fewer (but bigger) jobs.
>
> You probably want to look at what you can do with the slurmdbd database
> and associations. Things like GrpTRES:
>
> https://slurm.schedmd.com/sacctmgr.html
>
> # GrpTRES=<TRES=max TRES,...>
> # Maximum number of TRES running jobs are able to be allocated in
> # aggregate for this association and all associations which are
> children
> # of this association. To clear a previously set value use the modify
> # command with a new value of -1 for each TRES id.
> #
> # NOTE: This limit only applies fully when using the Select Consumable
> # Resource plugin.
>
> Best of luck,
> Chris
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180207/321cee28/attachment.html>
More information about the slurm-users
mailing list