[slurm-users] [EXT] Re: Limit the number of GPUS per user per partition
Theis, Thomas
Thomas.Theis at Teledyne.com
Thu May 7 17:15:35 UTC 2020
Here is the outputs
sacctmgr show qos –p
Name|Priority|GraceTime|Preempt|PreemptMode|Flags|UsageThres|UsageFactor|GrpTRES|GrpTRESMins|GrpTRESRunMins|GrpJobs|GtPA|MinTRES|
normal|10000|00:00:00||cluster|||1.000000|gres/gpu=2||||||||||gres/gpu=2|||||||
now|1000000|00:00:00||cluster|||1.000000||||||||||||||||||
high|100000|00:00:00||cluster|||1.000000||||||||||||||||||
scontrol show part
PartitionName=PART1
AllowGroups=trace_unix_group AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=node1,node2,node3,node4,…. PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=236 TotalNodes=11 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
Thomas Theis
From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Sean Crosby
Sent: Wednesday, May 6, 2020 6:22 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com>
Subject: Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition
External Email
Do you have other limits set? The QoS is hierarchical, and especially partition QoS can override other QoS.
What's the output of
sacctmgr show qos -p
and
scontrol show part
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Wed, 6 May 2020 at 23:44, Theis, Thomas <Thomas.Theis at teledyne.com<mailto:Thomas.Theis at teledyne.com>> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation attempts.
________________________________
Still have the same issue when I updated the user and qos..
Command I am using.
‘sacctmgr modify qos normal set MaxTRESPerUser=gres/gpu=2’
I restarted the services. Unfortunately I am still have to saturate the cluster with jobs.
We have a cluster of 10 nodes each with 4 gpus, for a total of 40 gpus. Each node is identical in the software, OS, SLURM. etc.. I am trying to limit each user to only be able to use 2 out of 40 gpus across the entire cluster or partition. A intended bottle neck so no one can saturate the cluster..
I.E. desired outcome would be. Person A submits 100 jobs, 2 would run , and 98 would be pending, 38 gpus would be idle. Once the 2 running are finished, 2 more would run and 96 would be pending, still 38 gpus would be idle..
Thomas Theis
From: slurm-users <slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>> On Behalf Of Sean Crosby
Sent: Tuesday, May 5, 2020 6:48 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com<mailto:slurm-users at lists.schedmd.com>>
Subject: Re: [slurm-users] [EXT] Re: Limit the number of GPUS per user per partition
External Email
Hi Thomas,
That value should be
sacctmgr modify qos gpujobs set MaxTRESPerUser=gres/gpu=4
Sean
--
Sean Crosby | Senior DevOpsHPC Engineer and HPC Team Lead
Research Computing Services | Business Services
The University of Melbourne, Victoria 3010 Australia
On Wed, 6 May 2020 at 04:53, Theis, Thomas <Thomas.Theis at teledyne.com<mailto:Thomas.Theis at teledyne.com>> wrote:
UoM notice: External email. Be cautious of links, attachments, or impersonation attempts.
________________________________
Hey Killian,
I tried to limit the number of gpus a user can run on at a time by adding MaxTRESPerUser = gres:gpu4 to both the user and the qos.. I restarted slurm control daemon and unfortunately I am still able to run on all the gpus in the partition. Any other ideas?
Thomas Theis
From: slurm-users <slurm-users-bounces at lists.schedmd.com<mailto:slurm-users-bounces at lists.schedmd.com>> On Behalf Of Killian Murphy
Sent: Thursday, April 23, 2020 1:33 PM
To: Slurm User Community List <slurm-users at lists.schedmd.com<mailto:slurm-users at lists.schedmd.com>>
Subject: Re: [slurm-users] Limit the number of GPUS per user per partition
External Email
Hi Thomas.
We limit the maximum number of GPUs a user can have allocated in a partition through the MaxTRESPerUser field of a QoS for GPU jobs, which is set as the partition QoS on our GPU partition. I.E:
We have a QOS `gpujobs` that sets MaxTRESPerUser => gres/gpu:4 to limit total number of allocated GPUs to 4, and set the GPU partition QoS to the `gpujobs` QoS.
There is a section in the Slurm documentation on the 'Resource Limits' page entitled 'QOS specific limits supported (https://slurm.schedmd.com/resource_limits.html) that details some care needed when using this kind of limit setting with typed GRES. Although it seems like you are trying to do something with generic GRES, it's worth a read!
Killian
On Thu, 23 Apr 2020 at 18:19, Theis, Thomas <Thomas.Theis at teledyne.com<mailto:Thomas.Theis at teledyne.com>> wrote:
Hi everyone,
First message, I am trying find a good way or multiple ways to limit the usage of jobs per node or use of gpus per node, without blocking a user from submitting them.
Example. We have 10 nodes each with 4 gpus in a partition. We allow a team of 6 people to submit jobs to any or all of the nodes. One job per gpu; thus we can hold a total of 40 jobs concurrently in the partition.
At the moment: each user usually submit 50- 100 jobs at once. Taking up all gpus, and all other users have to wait in pending..
What I am trying to setup is allow all users to submit as many jobs as they wish but only run on 1 out of the 4 gpus per node, or some number out of the total 40 gpus across the entire partition. Using slurm 18.08.3..
This is roughly our slurm scripts.
#SBATCH --job-name=Name # Job name
#SBATCH --mem=5gb # Job memory request
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --partition=PART1
#SBATCH --time=200:00:00 # Time limit hrs:min:sec
#SBATCH --output=job _%j.log # Standard output and error log
#SBATCH --nodes=1
#SBATCH --qos=high
srun -n1 --gres=gpu:1 --exclusive --export=ALL bash -c "NV_GPU=$SLURM_JOB_GPUS nvidia-docker run --rm -e SLURM_JOB_ID=$SLURM_JOB_ID -e SLURM_OUTPUT=$SLURM_OUTPUT --name $SLURM_JOB_ID do_job.sh"
Thomas Theis
--
Killian Murphy
Research Software Engineer
Wolfson Atmospheric Chemistry Laboratories
University of York
Heslington
York
YO10 5DD
+44 (0)1904 32 4753
e-mail disclaimer: http://www.york.ac.uk/docs/disclaimer/email.htm
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200507/e27932ba/attachment-0001.htm>
More information about the slurm-users
mailing list