Dear Team,
I have a scenario where I need to provide priority access to multiple users from different projects for only 3 nodes. This means that, at any given time, only 3 nodes can be used in that partition, and if one user is utilizing all 3 nodes, no other user should be able to submit jobs to that partition, or their jobs should remain in the queue.
To achieve this, I attempted to use QoS by creating a floating partition with some of the nodes and configuring a QoS with priority. I also set a limit with GrpTRES=gres/gpu=24, given that each node has 8 GPUs, and there are 3 nodes in total. I then attached the QoS to the partition and assigned it to the users who need access. I Also tried MaxTRES=gres/gpu=24
While this setup works as expected in the testing environment for CPUs, it is not functioning as intended in production, and it is not effectively restricting node usage in the partition. Could anyone provide suggestions or guidance on how to properly implement node restrictions along with priority?
Thank you for your assistance.
Best regards, Manisha Yadav
------------------------------------------------------------------------------------------------------------ [ C-DAC is on Social-Media too. Kindly follow us at: Facebook: https://www.facebook.com/CDACINDIA & Twitter: @cdacindia ]
This e-mail is for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies and the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email is strictly prohibited and appropriate legal action will be taken. ------------------------------------------------------------------------------------------------------------
Manisha Yadav via slurm-users slurm-users@lists.schedmd.com writes:
To achieve this, I attempted to use QoS by creating a floating partition with some of the nodes and configuring a QoS with priority. I also set a limit with GrpTRES=gres/gpu=24, given that each node has 8 GPUs, and there are 3 nodes in total.
If there are more nodes with GPUs, this will not prevent these users from getting GPUs on more than 3 nodes, it will only prevent them from getting more than 24 GPUs. It will not prevent them from running cpu-only jobs on other nodes either.. I think using GrpTRES=gres/gpu=24,node=3 (or perhaps simply GrpTRES=node=3) should work.
Hii,
Thanks for your valuable reply! Based on your input, I made the following changes to the system configuration: Created a new QoS: Priority: 200 Restriction: 3 nodes, 24 GPUs
Here are the commands I used: sacctmgr add qos test sacctmgr modify qos test set priority=200 sacctmgr modify qos test set GrpTRES=cpu=24 sacctmgr modify qos test set GrpTRES=gres/gpu=24,node=3
Attached the QoS to users from different groups as their default QoS.
Created a floating partition with all the nodes from the default partition and attached the same QoS to this partition. The configuration is as follows:
PartitionName=testingp MaxTime=7-0:00:00 DefaultTime=01:00:00 AllowQos=test State=UP Nodes=node1,node2,node3,node4,node5,node5 DefCpuPerGPU=16 MaxCPUsPerNode=192
However, when the users submit their jobs to the testingp partition, they are not receiving the expected priority. Their jobs are stuck in the queue and are not being allocated resources, while users without any priority are able to get resources on the default partition.
Could you please confirm if my setup is correct, or if any modifications are required on my end? My slurm version is slurm 21.08.6
Manisha Yadav manishay@cdac.in writes:
Could you please confirm if my setup is correct, or if any modifications are required on my end?
I don't see anything wrong with the part of the setup that you've shown.
Have you checked with `sprio -l -j <jobids>` whether the jobs get the extra qos priority? If not, perhaps the multifactor priority plugin isn't in use, or the qos weight is zero. See, e.g, https://slurm.schedmd.com/qos.html
My slurm version is slurm 21.08.6
Oh, that is old. I'd seriously consider upgrading. For instance, this is too old to get security patches.