Hello,
I'm doing some test with "associations" with "sacctmgr". I have created three users (user_1, user_2 and user_3). For each of these users, I have created an association:
[root@myserver log]# sacctmgr show user user_1 --associations User Def Acct Admin Cluster Account Partition Share Priority MaxJobs MaxNodes MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def QOS ---------- ---------- --------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- --------- ----------- ----------- -------------------- --------- user_1 test None q50004 test aolin.q 1 4 2 10 normal user_1 test None q50004 test cuda-staf+ 1 4 2 10 normal
[root@myserver log]# sacctmgr show user user_2 --associations User Def Acct Admin Cluster Account Partition Share Priority MaxJobs MaxNodes MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def QOS ---------- ---------- --------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- --------- ----------- ----------- -------------------- --------- user_2 test None q50004 test cuda-int.q 1 4 normal
[root@myserver log]# sacctmgr show user user_3 --associations User Def Acct Admin Cluster Account Partition Share Priority MaxJobs MaxNodes MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def QOS ---------- ---------- --------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- --------- ----------- ----------- -------------------- --------- user_3 test None q50004 test research.q 1 2 1 normal user_3 test None q50004 test xeon.q 1 2 1 normal
All users belong to "Test" account: [root@myserver log]# sacctmgr show account test --association Account Descr Org Cluster ParentName User Share Priority GrpJobs GrpNodes GrpCPUs GrpMem GrpSubmit GrpWall GrpCPUMins MaxJobs MaxNodes MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def QOS ---------- -------------------- -------------------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- ------- --------- ----------- ----------- ------- -------- -------- --------- ----------- ----------- -------------------- --------- test test test q50004 root 1 normal test test test q50004 user_1 1 4 2 10 normal test test test q50004 user_1 1 4 2 10 normal test test test q50004 user_2 1 4 normal test test test q50004 user_3 1 2 1 normal test test test q50004 user_3 1 2 1 normal
When I submit with "user_1", all tests are running fine: some jobs are queued and executed and some jobs are rejected because of the limits. However, with users "user_2" and "user_3" I can't submit any job. All jobs are rejected with these messages: 11168 research. test user_3 PENDING 0:00 2024-04-17T12:53:21 N/A 1 1 OK N/A (AssocMaxCpuPerJo (null) 11173 research. test user_3 PENDING 0:00 2024-04-17T13:06:02 N/A 1 1 OK N/A (AssocMaxCpuPerJo (null) 11174 research. test user_3 PENDING 0:00 2024-04-17T13:06:16 N/A 1 1 OK N/A (AssocMaxCpuPerJo (null) 11176 research. test user_3 PENDING 0:00 2024-04-17T13:07:23 N/A 1 1 OK N/A (AssocMaxCpuPerJo (null) 11180 research. test user_3 PENDING 0:00 2024-04-17T13:08:45 N/A 1 1 OK N/A (AssocMaxCpuPerJo (null)
For example, user "user_3" are trying to submit in this way (test.sh script only is a simple "sleep 50": sbatch -p aolin.q -N 2 ./test.sh --> sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified sbatch -p aolin.q -N 1 ./test.sh --> sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified sbatch -p research.q -N 1 ./test.sh --> submitted but not running --> nodelist(reason)= (AssocMaxCpuPerJobLimit) -> WHY??? sbatch -p research.q -N 1 -n 1 ./test.sh --> submitted but not running --> nodelist(reason)= (AssocMaxCpuPerJobLimit) --> WHY??? sbatch -p xeon.q -N 1 -n 1 ./test.sh --> submitted and running!!
[root@myserver log]# squeue JOBID PARTITION NAME USER STATE TIME SUBMIT_TIME START_TIME NODE CPUS OVER_S TRES_PER_NODE NODELIST(REASON) DEPENDENCY REQ_NODES NODELIST 11202 research. test user_3 PENDING 0:00 2024-04-17T13:33:31 N/A 1 1 OK N/A (AssocMaxCpuPerJo (null) 11200 research. test user_3 PENDING 0:00 2024-04-17T13:33:17 N/A 1 1 OK N/A (AssocMaxCpuPerJo (null) 11212 xeon.q test user_3 RUNNING 0:18 2024-04-17T13:36:10 2024-04-17T13:36:10 1 1 OK N/A aolin-cpu-1 (null) aolin-cpu-1
Why? What am I doing wrong? Where is the limit that I am not seeing?
Thanks a lot!