[slurm-users] [EXT] Re: Limit the number of GPUS per user per partition

Wed May 6 13:44:38 UTC 2020

Still have the same issue when I updated the user and qos..
Command I am using.
‘sacctmgr modify qos normal set MaxTRESPerUser=gres/gpu=2’
I restarted the services. Unfortunately I am still have to saturate the cluster with jobs.

We have a cluster of 10 nodes each with 4 gpus, for a total of 40 gpus. Each node is identical in the software, OS, SLURM. etc.. I am trying to limit each user to only be able to use 2 out of 40 gpus across the entire cluster or partition. A intended bottle neck so no one can saturate the cluster..

I.E. desired outcome would be. Person A submits 100 jobs, 2 would run , and 98 would be pending, 38 gpus would be idle. Once the 2 running are finished, 2 more would run and 96 would be pending, still 38 gpus would be idle..

Thomas Theis

From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Sean Crosby
Sent: Tuesday, May 5, 2020 6:48 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition

External Email
Hi Thomas,

That value should be

sacctmgr modify qos gpujobs set MaxTRESPerUser=gres/gpu=4

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia

On Wed, 6 May 2020 at 04:53, Theis, Thomas <Thomas.Theis at teledyne.com<mailto:Thomas.Theis at teledyne.com>> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation attempts.
________________________________
Hey Killian,

I tried to limit the number of gpus a user can run on at a time by adding MaxTRESPerUser = gres:gpu4 to both the user and the qos.. I restarted slurm control daemon and unfortunately I am still able to run on all the gpus in the partition. Any other ideas?

Thomas Theis

From: slurm-users <slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>> On Behalf Of Killian Murphy
Sent: Thursday, April 23, 2020 1:33 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com<mailto:slurm-users at lists.schedmd.com>>
Subject: Re: [slurm-users] Limit the number of GPUS per user per partition

External Email
Hi Thomas.

We limit the maximum number of GPUs a user can have allocated in a partition through the MaxTRESPerUser field of a QoS for GPU jobs, which is set as the partition QoS on our GPU partition. I.E:

We have a QOS `gpujobs` that sets MaxTRESPerUser => gres/gpu:4 to limit total number of allocated GPUs to 4, and set the GPU partition QoS to the `gpujobs` QoS.

There is a section in the Slurm documentation on the 'Resource Limits' page entitled 'QOS specific limits supported (https://slurm.schedmd.com/resource_limits.html) that details some care needed when using this kind of limit setting with typed GRES. Although it seems like you are trying to do something with generic GRES, it's worth a read!

Killian

On Thu, 23 Apr 2020 at 18:19, Theis, Thomas <Thomas.Theis at teledyne.com<mailto:Thomas.Theis at teledyne.com>> wrote:
Hi everyone,
First message, I am trying find a good way or multiple ways to limit the usage of jobs per node or use of gpus per node, without blocking a user from submitting them.

Example. We have 10 nodes each with 4 gpus in a partition. We allow a team of 6 people to submit jobs to any or all of the nodes. One job per gpu; thus we can hold a total of 40 jobs concurrently in the partition.
At the moment: each user usually submit 50- 100 jobs at once. Taking up all gpus, and all other users have to wait in pending..

What I am trying to setup is allow all users to submit as many jobs as they wish but only run on 1 out of the 4 gpus per node, or some number out of the total 40 gpus across the entire partition. Using slurm 18.08.3..

This is roughly our slurm scripts.

#SBATCH --job-name=Name # Job name
#SBATCH --mem=5gb                     # Job memory request
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --partition=PART1
#SBATCH --time=200:00:00               # Time limit hrs:min:sec
#SBATCH --output=job _%j.log         # Standard output and error log
#SBATCH --nodes=1
#SBATCH --qos=high

srun -n1 --gres=gpu:1 --exclusive --export=ALL bash -c "NV_GPU=$SLURM_JOB_GPUS nvidia-docker run --rm -e SLURM_JOB_ID=$SLURM_JOB_ID -e SLURM_OUTPUT=$SLURM_OUTPUT --name $SLURM_JOB_ID do_job.sh"

Thomas Theis

--
Killian Murphy
Research Software Engineer

Wolfson Atmospheric Chemistry Laboratories
University of York
Heslington
York
YO10 5DD
+44 (0)1904 32 4753

e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200506/f33f8b0e/attachment-0001.htm>