[slurm-users] Single user consuming all resources of the cluster

Tue Feb 6 23:28:53 MST 2018

Thanks Bill, I really appreciate the time you spent giving this detailed
answer.
I will have a look at the plugin system as the integration with out
accounting system would be a nice feature.

@Chris thanks, I've had a look GrpTRES but I'll probably go with the Spank
route.

Best,
Matteo

On 6 February 2018 at 13:58, Bill Barth <bbarth at tacc.utexas.edu> wrote:

> Chris probably gives the Slurm-iest way to do this, but we use a Spank
> plugin that counts the jobs that a user has in queue (running and waiting)
> and sets a hard cap on how many they can have. This should probably be
> scaled to the size of the system and the partition they are submitting to,
> but on Stampede 2 (4200 KNL nodes and 1736 SKX nodes), we set this, across
> all queues to about 50, which has been our magic number, across numerous
> schedulers over the years on systems ranging from hundreds of nodes to
> Stamped2e 1 with 6400. Some users get more by request and most don’t even
> bump up against the limits. We’ve started to look at using TRES on our test
> system, but we haven’t gotten there yet. Our use of the DB is minimal, and
> our process to get every user into it when their TACC account is created is
> not 100% automated yet (we use the job completion plugin to create a flat
> file with job records which our local accounting system consumes to
> decrement allocation balances, if you care to know).
>
> Best,
> Bill.
>
> --
> Bill Barth, Ph.D., Director, HPC
> bbarth at tacc.utexas.edu        |   Phone: (512) 232-7069
> Office: ROC 1.435            |   Fax:   (512) 475-9445
>
>
>
> On 2/6/18, 6:03 AM, "slurm-users on behalf of Christopher Samuel" <
> slurm-users-bounces at lists.schedmd.com on behalf of chris at csamuel.org>
> wrote:
>
>     On 06/02/18 21:40, Matteo F wrote:
>
>     > I've tried to limit the number of running job using Qos ->
>     > MaxJobsPerAccount, but this wouldn't stop a user to just fill up the
>     > cluster with fewer (but bigger) jobs.
>
>     You probably want to look at what you can do with the slurmdbd database
>     and associations. Things like GrpTRES:
>
>     https://slurm.schedmd.com/sacctmgr.html
>
>     # GrpTRES=<TRES=max TRES,...>
>     #     Maximum number of TRES running jobs are able to be allocated in
>     # aggregate for this association and all associations which are
> children
>     # of this association. To clear a previously set value use the modify
>     # command with a new value of -1 for each TRES id.
>     #
>     #  NOTE: This limit only applies fully when using the Select Consumable
>     # Resource plugin.
>
>     Best of luck,
>     Chris
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180207/321cee28/attachment.html>