Something odd is going on on our cluster. User has a lot of pending jobs in a job array (a few thousand).
squeue -u kmnx005 -r -t PD | head -5 JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 3045324_875 core run_scp_ kmnx005 PD 0:00 1 (JobArrayTaskLimit) 3045324_876 core run_scp_ kmnx005 PD 0:00 1 (JobArrayTaskLimit) 3045324_877 core run_scp_ kmnx005 PD 0:00 1 (JobArrayTaskLimit) 3045324_878 core run_scp_ kmnx005 PD 0:00 1 (JobArrayTaskLimit)
None are getting scheduled. But when I ask SLURM what that job’s priority is, it produces no output:
$ sprio -j 3045324 JOBID PARTITION PRIORITY SITE AGE FAIRSHARE JOBSIZE PARTITION QOS TRES
Any clues what’s going on here? -- Tim Cutts Scientific Computing Platform Lead AstraZeneca
Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our Service Cataloguehttps://azcollaboration.sharepoint.com/sites/CMU993 |
________________________________
AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.
This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.comhttps://www.astrazeneca.com
Hi Tim,
On 10/7/24 11:13, Cutts, Tim via slurm-users wrote:
Something odd is going on on our cluster. User has a lot of pending jobs in a job array (a few thousand).
squeue -u kmnx005 -r -t PD | head -5
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3045324_875 core run_scp_ kmnx005 PD 0:00 1 (JobArrayTaskLimit)
3045324_876 core run_scp_ kmnx005 PD 0:00 1 (JobArrayTaskLimit)
3045324_877 core run_scp_ kmnx005 PD 0:00 1 (JobArrayTaskLimit)
3045324_878 core run_scp_ kmnx005 PD 0:00 1 (JobArrayTaskLimit)
None are getting scheduled. But when I ask SLURM what that job’s priority is, it produces no output:
$ sprio -j 3045324
JOBID PARTITION PRIORITY SITE AGE FAIRSHARE JOBSIZE PARTITION QOS TRES
Any clues what’s going on here?
What array limits do you have in slurm.conf? For example:
$ scontrol show config | grep -i array MaxArraySize = 1001
/Ole
I should be clear, the JobArrayTaskLimit isn’t the issue (the user’s submitted with %1, which is why we’re getting that). What I don’t understand is why the jobs remaining in the queue have no priority at all associated with them. It’s as though the scheduler has forgotten the job array exists altogether.
Tim
-- Tim Cutts Scientific Computing Platform Lead AstraZeneca
Find out more about R&D IT Data, Analytics & AI and how we can support you by visiting our Service Cataloguehttps://azcollaboration.sharepoint.com/sites/CMU993 |
From: Ole Holm Nielsen via slurm-users slurm-users@lists.schedmd.com Date: Monday, 7 October 2024 at 10:35 AM To: slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: Jobs not getting scheduled, no priority calculation, but still in queue? Hi Tim,
On 10/7/24 11:13, Cutts, Tim via slurm-users wrote:
Something odd is going on on our cluster. User has a lot of pending jobs in a job array (a few thousand).
squeue -u kmnx005 -r -t PD | head -5
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
3045324_875 core run_scp_ kmnx005 PD 0:00 1
(JobArrayTaskLimit)
3045324_876 core run_scp_ kmnx005 PD 0:00 1
(JobArrayTaskLimit)
3045324_877 core run_scp_ kmnx005 PD 0:00 1
(JobArrayTaskLimit)
3045324_878 core run_scp_ kmnx005 PD 0:00 1
(JobArrayTaskLimit)
None are getting scheduled. But when I ask SLURM what that job’s priority is, it produces no output:
$ sprio -j 3045324
JOBID PARTITION PRIORITY SITE AGE FAIRSHARE
JOBSIZE PARTITION QOS TRES
Any clues what’s going on here?
What array limits do you have in slurm.conf? For example:
$ scontrol show config | grep -i array MaxArraySize = 1001
/Ole
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com ________________________________
AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.
This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.comhttps://www.astrazeneca.com
On 10/7/24 12:28, Cutts, Tim wrote:
I should be clear, the JobArrayTaskLimit isn’t the issue (the user’s submitted with %1, which is why we’re getting that). What I don’t understand is why the jobs remaining in the queue have no priority at all associated with them. It’s as though the scheduler has forgotten the job array exists altogether.
I see what you mean. The squeue command can print job priority and a lot of other things as defined under the "-O" option. I set this variable to personalize my desired columns:
export SQUEUE_FORMAT2="JobID:8,Partition:15,QOS:7,Name:10 ,UserName:9,Account:11,State:8,PriorityLong:9,ReasonList:16 ,TimeUsed:12 ,SubmitTime:19 ,TimeLimit:10 ,tres-alloc: "
You can also use "scontrol show job <jobid>".
In https://slurm.schedmd.com/job_array.html you can see that sprio doesn't handle job arrays yet:
The following Slurm commands do not currently recognize job arrays and their use requires the use of Slurm job IDs, which are unique for each array element: sbcast, sprio, sreport, sshare and sstat. The sacct, sattach and strigger commands have been modified to permit specification of either job IDs or job array elements. The sview command has been modified to permit display of a job's ArrayJobId and ArrayTaskId fields. Both fields are displayed with a value of "N/A" if the job is not part of a job array.
Also, there are a couple of hints in this Wiki page: https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_operations/#job-arrays
/Ole
*From: *Ole Holm Nielsen via slurm-users slurm-users@lists.schedmd.com *Date: *Monday, 7 October 2024 at 10:35 AM *To: *slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com *Subject: *[slurm-users] Re: Jobs not getting scheduled, no priority calculation, but still in queue?
Hi Tim,
On 10/7/24 11:13, Cutts, Tim via slurm-users wrote:
Something odd is going on on our cluster. User has a lot of pending jobs in a job array (a few thousand).
squeue -u kmnx005 -r -t PD | head -5
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3045324_875 core run_scp_ kmnx005 PD 0:00 1 (JobArrayTaskLimit)
3045324_876 core run_scp_ kmnx005 PD 0:00 1 (JobArrayTaskLimit)
3045324_877 core run_scp_ kmnx005 PD 0:00 1 (JobArrayTaskLimit)
3045324_878 core run_scp_ kmnx005 PD 0:00 1 (JobArrayTaskLimit)
None are getting scheduled. But when I ask SLURM what that job’s priority is, it produces no output:
$ sprio -j 3045324
JOBID PARTITION PRIORITY SITE AGE FAIRSHARE JOBSIZE PARTITION QOS TRES
Any clues what’s going on here?
What array limits do you have in slurm.conf? For example:
$ scontrol show config | grep -i array MaxArraySize = 1001