Hello,
I’m doing some test with “associations” with “sacctmgr”. I have created three users (user_1, user_2 and user_3). For each of these users, I have created an association:
[root@myserver log]# sacctmgr show user user_1 --associations
User Def Acct Admin Cluster Account Partition Share Priority MaxJobs MaxNodes MaxCPUs MaxSubmit MaxWall MaxCPUMins
QOS Def QOS
---------- ---------- --------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- --------- ----------- ----------- --------------------
---------
user_1 test None q50004 test aolin.q 1 4 2 10
normal
user_1 test None q50004 test cuda-staf+ 1 4 2 10
normal
[root@myserver log]# sacctmgr show user user_2 --associations
User Def Acct Admin Cluster Account Partition Share Priority MaxJobs MaxNodes MaxCPUs MaxSubmit MaxWall MaxCPUMins
QOS Def QOS
---------- ---------- --------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- --------- ----------- ----------- --------------------
---------
user_2 test None q50004 test cuda-int.q 1 4
normal
[root@myserver log]# sacctmgr show user user_3 --associations
User Def Acct Admin Cluster Account Partition Share Priority MaxJobs MaxNodes MaxCPUs MaxSubmit MaxWall MaxCPUMins
QOS Def QOS
---------- ---------- --------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- --------- ----------- ----------- --------------------
---------
user_3 test None q50004 test research.q 1 2 1
normal
user_3 test None q50004 test xeon.q 1 2 1
normal
All users belong to “Test” account:
[root@myserver log]# sacctmgr show account test --association
Account Descr Org Cluster ParentName User Share Priority GrpJobs GrpNodes GrpCPUs GrpMem GrpSubmit GrpWall GrpCPUMins
MaxJobs MaxNodes MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def QOS
---------- -------------------- -------------------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- ------- --------- ----------- -----------
------- -------- -------- --------- ----------- ----------- -------------------- ---------
test test test q50004 root 1
normal
test test test q50004 user_1 1
4 2 10 normal
test test test q50004 user_1 1
4 2 10 normal
test test test q50004 user_2 1
4 normal
test test test q50004 user_3 1
2 1 normal
test test test q50004 user_3 1
2 1 normal
When I submit with “user_1”, all tests are running fine: some jobs are queued and executed and some jobs are rejected because of the limits.
However, with users “user_2” and “user_3” I can’t submit any job. All jobs are rejected with these messages:
11168 research. test user_3 PENDING 0:00 2024-04-17T12:53:21 N/A 1 1 OK N/A
(AssocMaxCpuPerJo (null)
11173 research. test user_3 PENDING 0:00 2024-04-17T13:06:02 N/A 1 1 OK N/A
(AssocMaxCpuPerJo (null)
11174 research. test user_3 PENDING 0:00 2024-04-17T13:06:16 N/A 1 1 OK N/A
(AssocMaxCpuPerJo (null)
11176 research. test user_3 PENDING 0:00 2024-04-17T13:07:23 N/A 1 1 OK N/A
(AssocMaxCpuPerJo (null)
11180 research. test user_3 PENDING 0:00 2024-04-17T13:08:45 N/A 1 1 OK N/A
(AssocMaxCpuPerJo (null)
For example, user “user_3” are trying to submit in this way (test.sh script only is a simple “sleep 50”:
sbatch -p aolin.q -N 2 ./test.sh
à sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
sbatch -p aolin.q -N 1 ./test.sh
à sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
sbatch -p research.q -N 1 ./test.sh
à submitted but not running
à nodelist(reason)=
(AssocMaxCpuPerJobLimit) -> WHY???
sbatch -p research.q -N 1 -n 1 ./test.sh
à submitted but not running
à nodelist(reason)=
(AssocMaxCpuPerJobLimit)
à WHY???
sbatch -p xeon.q -N 1 -n 1 ./test.sh
à submitted and running!!
[root@myserver log]# squeue
JOBID PARTITION NAME USER STATE TIME SUBMIT_TIME START_TIME NODE CPUS OVER_S TRES_PER_NODE
NODELIST(REASON) DEPENDENCY REQ_NODES NODELIST
11202 research. test user_3 PENDING 0:00 2024-04-17T13:33:31 N/A 1 1 OK N/A
(AssocMaxCpuPerJo (null)
11200 research. test user_3 PENDING 0:00 2024-04-17T13:33:17 N/A 1 1 OK N/A
(AssocMaxCpuPerJo (null)
11212 xeon.q test user_3 RUNNING 0:18 2024-04-17T13:36:10 2024-04-17T13:36:10 1 1 OK N/A
aolin-cpu-1 (null) aolin-cpu-1
Why? What am I doing wrong? Where is the limit that I am not seeing?
Thanks a lot!