[slurm-users] [EXT] Re: Limit the number of GPUS per user per partition

Thu May 7 19:29:45 UTC 2020

Hello Krillian,

Unfortunately after setting the configuration for the partition to include the qos, and restarting the service. Verifying with sacctmgr, I still have the same issue..

Thomas Theis

From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Killian Murphy
Sent: Thursday, May 7, 2020 1:41 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition

External Email
Hi Thomas.

With that partition configuration, I suspect jobs are going through the partition without the QoS 'normal' which restricts the number of GPUs per user.

You may find that reconfiguring the partition to have a QoS of 'normal' will result in the GPU limit being applied, as intended. This is set in the partition configuration in slurm.conf.

Killian
On Thu, 7 May 2020 at 18:25, Theis, Thomas <Thomas.Theis at teledyne.com<mailto:Thomas.Theis at teledyne.com>> wrote:
Here is the outputs
sacctmgr show qos –p

Name|Priority|GraceTime|Preempt|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GtPA|MinTRES|
normal|10000|00:00:00||cluster|||1.000000|gres/gpu=2||||||||||gres/gpu=2|||||||
now|1000000|00:00:00||cluster|||1.000000||||||||||||||||||
high|100000|00:00:00||cluster|||1.000000||||||||||||||||||

scontrol show part

PartitionName=PART1
   AllowGroups=trace_unix_group AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=node1,node2,node3,node4,….   PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=236 TotalNodes=11 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

Thomas Theis

From: slurm-users <slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>> On Behalf Of Sean Crosby
Sent: Wednesday, May 6, 2020 6:22 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com<mailto:slurm-users at lists.schedmd.com>>
Subject: Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition

External Email
Do you have other limits set? The QoS is hierarchical, and especially partition QoS can override other QoS.

What's the output of

sacctmgr show qos -p

and

scontrol show part

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia

On Wed, 6 May 2020 at 23:44, Theis, Thomas <Thomas.Theis at teledyne.com<mailto:Thomas.Theis at teledyne.com>> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation attempts.
________________________________
Still have the same issue when I updated the user and qos..
Command I am using.
‘sacctmgr modify qos normal set MaxTRESPerUser=gres/gpu=2’
I restarted the services. Unfortunately I am still have to saturate the cluster with jobs.

We have a cluster of 10 nodes each with 4 gpus, for a total of 40 gpus. Each node is identical in the software, OS, SLURM. etc.. I am trying to limit each user to only be able to use 2 out of 40 gpus across the entire cluster or partition. A intended bottle neck so no one can saturate the cluster..

I.E. desired outcome would be. Person A submits 100 jobs, 2 would run , and 98 would be pending, 38 gpus would be idle. Once the 2 running are finished, 2 more would run and 96 would be pending, still 38 gpus would be idle..

Thomas Theis

From: slurm-users <slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>> On Behalf Of Sean Crosby
Sent: Tuesday, May 5, 2020 6:48 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com<mailto:slurm-users at lists.schedmd.com>>
Subject: Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition

External Email
Hi Thomas,

That value should be

sacctmgr modify qos gpujobs set MaxTRESPerUser=gres/gpu=4

Sean

--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia

On Wed, 6 May 2020 at 04:53, Theis, Thomas <Thomas.Theis at teledyne.com<mailto:Thomas.Theis at teledyne.com>> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation attempts.
________________________________
Hey Killian,

I tried to limit the number of gpus a user can run on at a time by adding MaxTRESPerUser = gres:gpu4 to both the user and the qos.. I restarted slurm control daemon and unfortunately I am still able to run on all the gpus in the partition. Any other ideas?

Thomas Theis

From: slurm-users <slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>> On Behalf Of Killian Murphy
Sent: Thursday, April 23, 2020 1:33 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com<mailto:slurm-users at lists.schedmd.com>>
Subject: Re: [slurm-users] Limit the number of GPUS per user per partition

External Email
Hi Thomas.

We limit the maximum number of GPUs a user can have allocated in a partition through the MaxTRESPerUser field of a QoS for GPU jobs, which is set as the partition QoS on our GPU partition. I.E:

We have a QOS `gpujobs` that sets MaxTRESPerUser => gres/gpu:4 to limit total number of allocated GPUs to 4, and set the GPU partition QoS to the `gpujobs` QoS.

There is a section in the Slurm documentation on the 'Resource Limits' page entitled 'QOS specific limits supported (https://slurm.schedmd.com/resource_limits.html) that details some care needed when using this kind of limit setting with typed GRES. Although it seems like you are trying to do something with generic GRES, it's worth a read!

Killian

On Thu, 23 Apr 2020 at 18:19, Theis, Thomas <Thomas.Theis at teledyne.com<mailto:Thomas.Theis at teledyne.com>> wrote:
Hi everyone,
First message, I am trying find a good way or multiple ways to limit the usage of jobs per node or use of gpus per node, without blocking a user from submitting them.

Example. We have 10 nodes each with 4 gpus in a partition. We allow a team of 6 people to submit jobs to any or all of the nodes. One job per gpu; thus we can hold a total of 40 jobs concurrently in the partition.
At the moment: each user usually submit 50- 100 jobs at once. Taking up all gpus, and all other users have to wait in pending..

What I am trying to setup is allow all users to submit as many jobs as they wish but only run on 1 out of the 4 gpus per node, or some number out of the total 40 gpus across the entire partition. Using slurm 18.08.3..

This is roughly our slurm scripts.

#SBATCH --job-name=Name # Job name
#SBATCH --mem=5gb                     # Job memory request
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --partition=PART1
#SBATCH --time=200:00:00               # Time limit hrs:min:sec
#SBATCH --output=job _%j.log         # Standard output and error log
#SBATCH --nodes=1
#SBATCH --qos=high

srun -n1 --gres=gpu:1 --exclusive --export=ALL bash -c "NV_GPU=$SLURM_JOB_GPUS nvidia-docker run --rm -e SLURM_JOB_ID=$SLURM_JOB_ID -e SLURM_OUTPUT=$SLURM_OUTPUT --name $SLURM_JOB_ID do_job.sh"

Thomas Theis

--
Killian Murphy
Research Software Engineer

Wolfson Atmospheric Chemistry Laboratories
University of York
Heslington
York
YO10 5DD
+44 (0)1904 32 4753

e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm

--
Killian Murphy
Research Software Engineer

Wolfson Atmospheric Chemistry Laboratories
University of York
Heslington
York
YO10 5DD
+44 (0)1904 32 4753

e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200507/4d4cb65f/attachment-0001.htm>