[slurm-users] Q about setting up CPU limits
Carsten Beyer
beyer at dkrz.de
Thu Sep 23 11:18:46 UTC 2021
Hi Dj,
the solution could be in two QOS. We use something similar to restrict
usage of GPU nodes (MaxTresPU=node=2). Examples below are from our
Testcluster.
1) create a QOS with e.g. MaxTresPU=cpu=200 and assign it to your
partition, e.g.
[root at bta0 ~]# sacctmgr -s show qos maxcpu format=Name,MaxTRESPU
Name MaxTRESPU
---------- -------------
maxcpu cpu=10
[root at bta0 ~]#
[root at bta0 ~]# scontrol show part maxtresputest
PartitionName=maxtresputest
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=maxcpu
If a user submits jobs requesting more cpus his (new) jobs get
'QOSMaxCpuPerUserLimit' in squeue.
kxxxxxx at btlogin1% squeue
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
125316 maxtrespu maxsubmi kxxxxxx PD 0:00 1
(QOSMaxCpuPerUserLimit)
125317 maxtrespu maxsubmi kxxxxxx PD 0:00 1
(QOSMaxCpuPerUserLimit)
125305 maxtrespu maxsubmi kxxxxxx R 0:45 1 btc30
125306 maxtrespu maxsubmi kxxxxxx R 0:45 1 btc30
2) create a second QOS with Flags=DenyOnLimit,OverPartQoS and
MaxTresPU=400. Assign it to a user that should overcome the limit of 200
cpus, but he will be limited then to 400. That user has to use this QOS,
when submiting new jobs, e.g.
[root at bta0 ~]# sacctmgr -s show qos overpart format=Name,Flags%30,MaxTRESPU
Name Flags MaxTRESPU
---------- ------------------------------ -------------
overpart DenyOnLimit,OverPartQOS cpu=40
Cheers,
Carsten
--
Carsten Beyer
Abteilung Systeme
Deutsches Klimarechenzentrum GmbH (DKRZ)
Bundesstraße 45a * D-20146 Hamburg * Germany
Phone: +49 40 460094-221
Fax: +49 40 460094-270
Email: beyer at dkrz.de
URL: http://www.dkrz.de
Geschäftsführer: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784
Am 22.09.2021 um 20:57 schrieb Dj Merrill:
> Hi all,
>
> I'm relatively new to Slurm and my Internet searches so far have
> turned up lots of examples from the client perspective, but not from
> the admin perspective on how to set this up, and I'm hoping someone
> can point us in the right direction. This should be pretty simple...
> :-)
>
> We have a test cluster running Slurm 21.08.1 and are trying to figure
> out how to set a limit of 200 CPU cores that can be requested in a
> partition. Basically, if someone submits a thousand single CPU core
> jobs, it should run 200 of them and the other 800 will wait in the
> queue until 1 is finished, then run their next job from the queue,
> etc, or if someone has a 180 CPU core job running and they submit a 30
> CPU core job, it should wait in the queue until the 180 core job
> finishes. If someone submits a job requesting 201 CPU cores, it
> should fail and give an error.
>
> According to the Slurm resource limits hierarchy, if a partition limit
> is set, we should be able to setup a user association to override it
> in the case where we might want someone to be able to access 300 CPU
> cores in that partition, for example.
>
> I can see in the Slurm documentation how to setup max nodes per
> partition, but have not been able to find how to do this with CPU cores.
>
> My questions are:
>
> 1) How do we setup a CPU core limit on a partition that applies to all
> users?
>
> 2) How do we setup a user association to allow a single person to use
> more than the default CPU core limit set on the partition?
>
> 3) Is there a better way to accomplish this than the method I'm asking?
>
>
> For reference, Slurm accounting is setup, GPU allocations are working
> properly, and I think we are close but just missing something obvious
> to setup the CPU core limits.
>
>
> Thank you,
>
>
> -Dj
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5316 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210923/c8f45729/attachment.bin>
More information about the slurm-users
mailing list