Hi Michael,
thanks for your explanation. I understand that setting "MaxTRESMinsPerJob=cpu=172800" will allow (in my case)
- a job on the full cluster for 6h - a job on half of the cluster for 12 hours
But if I do not wont the same user to run at the same time 2 jobs on half of the cluster for 12 hours (and fill in the cluster for long time) how can I limit his running jobs at 172800 minutes*cpu ? I was looking for something like "MaxTRESMinsPerUser" but do not find such a limitation resource.
Patrick
Le 18/04/2025 à 17:17, Michael Gutteridge a écrit :
Hi
I think you want one of the "MaxTRESMins*" options:
MaxTRESMins=TRES=<minutes>[,TRES=<minutes>,...] MaxTRESMinsPJ=TRES=<minutes>[,TRES=<minutes>,...] MaxTRESMinsPerJob=TRES=<minutes>[,TRES=<minutes>,...] Maximum number of TRES minutes each job is able to use in this association. This is overridden if set directly on a user. Default is the cluster's limit. To clear a previously set value use the modify command with a new value of -1 for each TRES id.
- sacctmgr(1)
The "MaxCPUs" is a limit on the number of CPUs the association can use.
-- Michael
On Fri, Apr 18, 2025 at 8:01 AM Patrick Begou via slurm-users slurm-users@lists.schedmd.com wrote:
Hi all, I'm trying to setup a QoS on a small 5 nodes cluster running slurm 24.05.7. My goal is to limit the resources on a (time x number of cores) strategy to avoid one large job requesting all the resources for too long time. I've read from https://slurm.schedmd.com/qos.html and some discussion but my setup is still not working. I think I need to set these informations: MaxCPUsPerJob=172800 MaxWallDurationPerJob=48:00:00 Flags=DenyOnLimit,OverPartQOS for: 12h max for 240 cores => (12*240*60=172800mn) no job can exceed 2 days do not accept jobs out of these limits. What I've done: 1) create the QoS: sudo sacctmgr add qos workflowlimit \ MaxWallDurationPerJob=48:00:00 \ MaxCPUsPerJob=172800 \ Flags=DenyOnLimit,OverPartQOS 2) Check sacctmgr show qos Name=workflowlimit format=Name%16,MaxTRES,MaxWall Name MaxTRES MaxWall ---------------- ------------- ----------- workflowlimit cpu=172800 2-00:00:00 3) Set the QoS for the account "most" which is the default account for the users: sudo sacctmgr modify account name=most set qos=workflowlimit 4) Check $ sacctmgr show assoc format=account,cluster,user,qos Account Cluster User QOS ---------- ---------- ---------- -------------------- root osorno normal root osorno root normal legi osorno normal most osorno workflowlimit most osorno begou workflowlimit 5) Modifiy slurm.conf with: AccountingStorageEnforce=limits,qos and propagate on the 5 nodes and the front end (done via Ansible) 6) Check clush -b -w osorno-fe,osorno,osorno-0-[0-4] 'grep AccountingStorageEnforce /etc/slurm/slurm.conf' --------------- osorno,osorno-0-[0-4],osorno-fe (7) --------------- AccountingStorageEnforce=limits,qos 7) restart slurmd on all the compute nodes and slurmctld + slurmdbd on the management node. But I can still request 400 cores for 24 hours: [begou@osorno ~]$ srun -n 400 -t 24:0:0 --pty bash bash-5.1$ squeue JOBID PARTITION NAME USER ST TIME START_TIME TIME_LIMIT CPUS NODELIST(REASON) 147 genoa bash begou R 0:03 2025-04-18T16:52:11 1-00:00:00 400 osorno-0-[0-4] So I must have missed something ? My partition (I've only one) in slurm.conf is: PartitionName=genoa State=UP Default=YES MaxTime=48:00:00 DefaultTime=24:00:00 Shared=YES OverSubscribe=NO Nodes=osorno-0-[0-4] Thanks Patrick -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com