[slurm-users] Limit the number of GPUS per user per partition

Killian Murphy killian.murphy at york.ac.uk
Thu Apr 23 17:32:54 UTC 2020


Hi Thomas.

We limit the maximum number of GPUs a user can have allocated in a
partition through the MaxTRESPerUser field of a QoS for GPU jobs, which is
set as the partition QoS on our GPU partition. I.E:

We have a QOS `gpujobs` that sets MaxTRESPerUser => gres/gpu:4 to limit
total number of allocated GPUs to 4, and set the GPU partition QoS to the
`gpujobs` QoS.

There is a section in the Slurm documentation on the 'Resource Limits' page
entitled 'QOS specific limits supported (
https://slurm.schedmd.com/resource_limits.html) that details some care
needed when using this kind of limit setting with typed GRES. Although it
seems like you are trying to do something with generic GRES, it's worth a
read!

Killian



On Thu, 23 Apr 2020 at 18:19, Theis, Thomas <Thomas.Theis at teledyne.com>
wrote:

> Hi everyone,
>
> First message, I am trying find a good way or multiple ways to limit the
> usage of jobs per node or use of gpus per node, without blocking a user
> from submitting them.
>
>
>
> Example. We have 10 nodes each with 4 gpus in a partition. We allow a team
> of 6 people to submit jobs to any or all of the nodes. One job per gpu;
> thus we can hold a total of 40 jobs concurrently in the partition.
>
> At the moment: each user usually submit 50- 100 jobs at once. Taking up
> all gpus, and all other users have to wait in pending..
>
>
>
> What I am trying to setup is allow all users to submit as many jobs as
> they wish but only run on 1 out of the 4 gpus per node, or some number out
> of the total 40 gpus across the entire partition. Using slurm 18.08.3..
>
>
>
> This is roughly our slurm scripts.
>
>
>
> #SBATCH --job-name=Name # Job name
>
> #SBATCH --mem=5gb                     # Job memory request
>
> #SBATCH --ntasks=1
>
> #SBATCH --gres=gpu:1
>
> #SBATCH --partition=PART1
>
> #SBATCH --time=200:00:00               # Time limit hrs:min:sec
>
> #SBATCH --output=job _%j.log         # Standard output and error log
>
> #SBATCH --nodes=1
>
> #SBATCH --qos=high
>
>
>
> srun -n1 --gres=gpu:1 --exclusive --export=ALL bash -c
> "NV_GPU=$SLURM_JOB_GPUS nvidia-docker run --rm -e
> SLURM_JOB_ID=$SLURM_JOB_ID -e SLURM_OUTPUT=$SLURM_OUTPUT --name
> $SLURM_JOB_ID do_job.sh"
>
>
>
> *Thomas Theis*
>
>
>


-- 
Killian Murphy
Research Software Engineer

Wolfson Atmospheric Chemistry Laboratories
University of York
Heslington
York
YO10 5DD
+44 (0)1904 32 4753

e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200423/6958559c/attachment.htm>


More information about the slurm-users mailing list