Dear All,
I tried to implement a strict limit on the GrpTRESMins for each user. The effect I'm trying to achieve is that after the limit of GPU minutes is reached, no new jobs can be run. No decay, no automatic resource replenishment. After the limit on GPU minutes is reached, each user should ask for more minutes. But despite exceeding the limits users *can* run new jobs.
* When I'm adding a user to the cluster I set:
sacctmgr --immediate add user name=... ... QOS=2gpu2d GrpTRESMins=gres/gpu=20000
* In the "slurm.conf" ("safe" means limits and associations are automatically set). Storage is MariaDB with SlurmDBD:
GresTypes=gpu AccountingStorageTRES=gres/gpu AccountingStorageEnforce=qos,safe # This disables GPU minutes replenishing. PriorityDecayHalfLife=0 PriorityUsageResetPeriod=NONE
But when I look at a user's account info and usage, you can see that the limits are not enforced.
Account User Partition QOS GrpTRESMins ---------- ---------------- ------------ ------------ -------------------- redacted redacted a6000 2gpu2d gres/gpu=10000
-------------------------------------------------------------------------------- Top 1 Users 2024-01-05T00:00:00 - 2024-01-17T19:59:59 (1108800 secs) Usage reported in TRES Minutes -------------------------------------------------------------------------------- Login Used TRES Name ------------ -------- ---------------- redacted 184311 gres/gpu redacted 1558558 cpu
Could someone explain, where could the problem be? Am I missing something? Apparently yes :)
Kind regards