Hi Michael,
thanks for your explanation. I
understand that setting "MaxTRESMinsPerJob=cpu=172800" will allow
(in my case)
- a job on the full cluster for 6h
- a job on half of the cluster for 12
hours
But if I do not wont the same user to
run at the same time 2 jobs on half of the cluster for 12 hours
(and fill in the cluster for long time) how can I limit his
running jobs at 172800 minutes*cpu ?
I was looking for something like
"MaxTRESMinsPerUser" but do not find such a limitation resource.
Patrick
Le 18/04/2025 à 17:17, Michael
Gutteridge a écrit :
Hi
I think
you want one of the "MaxTRESMins*" options:
MaxTRESMins=TRES=<minutes>[,TRES=<minutes>,...]
MaxTRESMinsPJ=TRES=<minutes>[,TRES=<minutes>,...]
MaxTRESMinsPerJob=TRES=<minutes>[,TRES=<minutes>,...]
Maximum number of TRES minutes each job is able to use in this
association. This is overridden if set directly on a user.
Default is the cluster's limit. To clear a previously set
value use the modify command with a new value of -1 for each
TRES id.
-
sacctmgr(1)
The
"MaxCPUs" is a limit on the number of CPUs the association can
use.
--
Michael
Hi
all,
I'm trying to setup a QoS on a small 5 nodes cluster running
slurm
24.05.7. My goal is to limit the resources on a (time x number
of cores)
strategy to avoid one large job requesting all the resources
for too
long time. I've read from https://slurm.schedmd.com/qos.html
and some
discussion but my setup is still not working.
I think I need to set these informations:
MaxCPUsPerJob=172800
MaxWallDurationPerJob=48:00:00
Flags=DenyOnLimit,OverPartQOS
for:
12h max for 240 cores => (12*240*60=172800mn)
no job can exceed 2 days
do not accept jobs out of these limits.
What I've done:
1) create the QoS:
sudo sacctmgr add qos workflowlimit \
MaxWallDurationPerJob=48:00:00 \
MaxCPUsPerJob=172800 \
Flags=DenyOnLimit,OverPartQOS
2) Check
sacctmgr show qos Name=workflowlimit
format=Name%16,MaxTRES,MaxWall
Name MaxTRES MaxWall
---------------- ------------- -----------
workflowlimit cpu=172800 2-00:00:00
3) Set the QoS for the account "most" which is the default
account for
the users:
sudo sacctmgr modify account name=most set qos=workflowlimit
4) Check
$ sacctmgr show assoc format=account,cluster,user,qos
Account Cluster User QOS
---------- ---------- ---------- --------------------
root osorno normal
root osorno root normal
legi osorno normal
most osorno workflowlimit
most osorno begou workflowlimit
5) Modifiy slurm.conf with:
AccountingStorageEnforce=limits,qos
and propagate on the 5 nodes and the front end (done via
Ansible)
6) Check
clush -b -w osorno-fe,osorno,osorno-0-[0-4] 'grep
AccountingStorageEnforce /etc/slurm/slurm.conf'
---------------
osorno,osorno-0-[0-4],osorno-fe (7)
---------------
AccountingStorageEnforce=limits,qos
7) restart slurmd on all the compute nodes and slurmctld +
slurmdbd on
the management node.
But I can still request 400 cores for 24 hours:
[begou@osorno ~]$ srun -n 400 -t 24:0:0 --pty bash
bash-5.1$ squeue
JOBID PARTITION NAME USER ST
TIME
START_TIME TIME_LIMIT CPUS NODELIST(REASON)
147 genoa bash begou R
0:03
2025-04-18T16:52:11 1-00:00:00 400 osorno-0-[0-4]
So I must have missed something ?
My partition (I've only one) in slurm.conf is:
PartitionName=genoa State=UP Default=YES MaxTime=48:00:00
DefaultTime=24:00:00 Shared=YES OverSubscribe=NO
Nodes=osorno-0-[0-4]
Thanks
Patrick
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com