[slurm-users] Single user consuming all resources of the cluster

Wed Feb 7 06:48:52 MST 2018

We use GrpTresRunMins for this, with the idea that it's OK for users to
occupy lots of resources with short-running jobs, but not so much with
long-running jobs.

On Wed, Feb 7, 2018 at 8:41 AM, Bill Barth <bbarth at tacc.utexas.edu> wrote:

> Of course, Matteo. Happy to help. Our job completion script is:
>
> #!/bin/bash
>
> OUTFILE=/var/log/slurm/tacc_jobs_completed
>
> echo "$JOBID:$UID:$ACCOUNT:$BATCH:$START:$END:$SUBMIT:$PARTITION:
> $LIMIT:$JOBNAME:$JOBSTATE:$NODECNT:$PROCS" >> $OUTFILE
>
> exit 0
>
> and our config settings (from scontrol show config) are:
>
> JobCompLoc              = /etc/slurm/tacc_job_completion.sh
> JobCompType             = jobcomp/script
>
> Feel free to steal as much of that as you like, just update the lines and
> names to remove the “tacc” parts. This script needs to be present on the
> machine that slurmctld is running on. Our internal accounting system is
> RESTful, so we’re thinking of using this mechanism to write accounting
> records to it directly in this plugin script with curl/wget calls rather
> than appending to our flat file and shipping that info to our database via
> cron script nightly. That would give us the ability to do live updates of
> balances (which the Slurm DB already supports) to prevent overdrawn
> accounts. This is convoluted, but we have had to reinvent the wheel a
> little since we need to report usage to both our local accounting database
> and a national one. Yes, there were probably other ways to do this, but the
> infrastructure is now historical and set in some stone.
>
> Best,
> Bill.
>
> --
> Bill Barth, Ph.D., Director, HPC
> bbarth at tacc.utexas.edu        |   Phone: (512) 232-7069
> Office: ROC 1.435            |   Fax:   (512) 475-9445
>
>
>
> On 2/7/18, 12:28 AM, "slurm-users on behalf of Matteo F" <
> slurm-users-bounces at lists.schedmd.com on behalf of mfasco984 at gmail.com>
> wrote:
>
>     Thanks Bill, I really appreciate the time you spent giving this
> detailed answer.
>     I will have a look at the plugin system as the integration with out
> accounting system would be a nice feature.
>
>
>
>
>     @Chris thanks, I've had a look GrpTRES but I'll probably go with the
> Spank route.
>
>
>     Best,
>     Matteo
>
>     On 6 February 2018 at 13:58, Bill Barth
>     <bbarth at tacc.utexas.edu> wrote:
>
>     Chris probably gives the Slurm-iest way to do this, but we use a Spank
> plugin that counts the jobs that a user has in queue (running and waiting)
> and sets a hard cap on how many they can have. This should probably be
> scaled to the size of the system and the
>      partition they are submitting to, but on Stampede 2 (4200 KNL nodes
> and 1736 SKX nodes), we set this, across all queues to about 50, which has
> been our magic number, across numerous schedulers over the years on systems
> ranging from hundreds of nodes to Stamped2e
>      1 with 6400. Some users get more by request and most don’t even bump
> up against the limits. We’ve started to look at using TRES on our test
> system, but we haven’t gotten there yet. Our use of the DB is minimal, and
> our process to get every user into it when
>      their TACC account is created is not 100% automated yet (we use the
> job completion plugin to create a flat file with job records which our
> local accounting system consumes to decrement allocation balances, if you
> care to know).
>
>     Best,
>     Bill.
>
>     --
>     Bill Barth, Ph.D., Director, HPC
>     bbarth at tacc.utexas.edu        |   Phone:
>     (512) 232-7069 <tel:%28512%29%20232-7069>
>     Office: ROC 1.435            |   Fax:   (512) 475-9445
> <tel:%28512%29%20475-9445>
>
>
>
>     On 2/6/18, 6:03 AM, "slurm-users on behalf of Christopher Samuel" <
> slurm-users-bounces at lists.schedmd.com on behalf of
>     chris at csamuel.org> wrote:
>
>         On 06/02/18 21:40, Matteo F wrote:
>
>         > I've tried to limit the number of running job using Qos ->
>         > MaxJobsPerAccount, but this wouldn't stop a user to just fill up
> the
>         > cluster with fewer (but bigger) jobs.
>
>         You probably want to look at what you can do with the slurmdbd
> database
>         and associations. Things like GrpTRES:
>
>
>     https://slurm.schedmd.com/sacctmgr.html <https://slurm.schedmd.com/
> sacctmgr.html>
>
>         # GrpTRES=<TRES=max TRES,...>
>         #     Maximum number of TRES running jobs are able to be allocated
> in
>         # aggregate for this association and all associations which are
> children
>         # of this association. To clear a previously set value use the
> modify
>         # command with a new value of -1 for each TRES id.
>         #
>         #  NOTE: This limit only applies fully when using the Select
> Consumable
>         # Resource plugin.
>
>         Best of luck,
>         Chris
>
>
>
>
>
>
>
>
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180207/9d6bb7b7/attachment.html>