Hi,
On our cluster we have some jobs that are queued even though there are available nodes to run on. The listed reason is "priority" but that doesn't really make sense to me. Slurm isn't picking another job to run on those nodes; it's just not running anything at all. We do have a quite heterogeneous cluster, but as far as I can tell the queued jobs aren't requesting anything that would preclude them from running on the idle nodes. They are array jobs, if that makes a difference.
Thanks for any help you all can provide.
In theory, if jobs are pending with “Priority”, one or more other jobs will be pending with “Resources”.
So a few questions:
1. What are the “Resources” jobs waiting on, resource-wise? 2. When are they scheduled to start? 3. Can your array jobs backfill into the idle resources and finish before the “Resources” jobs are scheduled to start?
From: Long, Daniel S. via slurm-users slurm-users@lists.schedmd.com Date: Tuesday, September 24, 2024 at 11:47 AM To: slurm-users@schedmd.com slurm-users@schedmd.com Subject: [slurm-users] Jobs pending with reason "priority" but nodes are idle
External Email Warning
This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
________________________________ Hi,
On our cluster we have some jobs that are queued even though there are available nodes to run on. The listed reason is “priority” but that doesn’t really make sense to me. Slurm isn’t picking another job to run on those nodes; it’s just not running anything at all. We do have a quite heterogeneous cluster, but as far as I can tell the queued jobs aren’t requesting anything that would preclude them from running on the idle nodes. They are array jobs, if that makes a difference.
Thanks for any help you all can provide.
I experimented a bit and think I have figured out the problem but not the solution.
We use multifactor priority with the job account the primary factor. Right now one project has much higher priority due to a deadline. Those are the jobs that are pending with “Resources”. They cannot run on the idle nodes because they do not satisfy the resource requirements (don’t have GPUs). What I don’t understand is why slurm doesn’t schedule the lower priority jobs onto those nodes, since those jobs don’t require GPUs. It’s very unexpected behavior, to me. Is there an option somewhere I need to set?
From: "Renfro, Michael" Renfro@tntech.edu Date: Tuesday, September 24, 2024 at 1:54 PM To: Daniel Long Daniel.Long@gtri.gatech.edu, "slurm-users@schedmd.com" slurm-users@schedmd.com Subject: Re: Jobs pending with reason "priority" but nodes are idle
In theory, if jobs are pending with “Priority”, one or more other jobs will be pending with “Resources”.
So a few questions:
1. What are the “Resources” jobs waiting on, resource-wise? 2. When are they scheduled to start? 3. Can your array jobs backfill into the idle resources and finish before the “Resources” jobs are scheduled to start?
From: Long, Daniel S. via slurm-users slurm-users@lists.schedmd.com Date: Tuesday, September 24, 2024 at 11:47 AM To: slurm-users@schedmd.com slurm-users@schedmd.com Subject: [slurm-users] Jobs pending with reason "priority" but nodes are idle
External Email Warning
This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
________________________________ Hi,
On our cluster we have some jobs that are queued even though there are available nodes to run on. The listed reason is “priority” but that doesn’t really make sense to me. Slurm isn’t picking another job to run on those nodes; it’s just not running anything at all. We do have a quite heterogeneous cluster, but as far as I can tell the queued jobs aren’t requesting anything that would preclude them from running on the idle nodes. They are array jobs, if that makes a difference.
Thanks for any help you all can provide.
Do you have backfill scheduling [1] enabled? If so, what settings are in place?
And the lower-priority jobs will only be eligible for backfill if and only if they don’t delay the start of the higher priority jobs.
So what kind of resources and time does a given array job require? Odds are, they have a time request that conflicts with the scheduled start time for the high priority jobs.
[1] https://slurm.schedmd.com/sched_config.html#backfill
From: Long, Daniel S. Daniel.Long@gtri.gatech.edu Date: Tuesday, September 24, 2024 at 1:20 PM To: Renfro, Michael Renfro@tntech.edu, slurm-users@schedmd.com slurm-users@schedmd.com Subject: Re: Jobs pending with reason "priority" but nodes are idle
External Email Warning
This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
________________________________ I experimented a bit and think I have figured out the problem but not the solution.
We use multifactor priority with the job account the primary factor. Right now one project has much higher priority due to a deadline. Those are the jobs that are pending with “Resources”. They cannot run on the idle nodes because they do not satisfy the resource requirements (don’t have GPUs). What I don’t understand is why slurm doesn’t schedule the lower priority jobs onto those nodes, since those jobs don’t require GPUs. It’s very unexpected behavior, to me. Is there an option somewhere I need to set?
From: "Renfro, Michael" Renfro@tntech.edu Date: Tuesday, September 24, 2024 at 1:54 PM To: Daniel Long Daniel.Long@gtri.gatech.edu, "slurm-users@schedmd.com" slurm-users@schedmd.com Subject: Re: Jobs pending with reason "priority" but nodes are idle
In theory, if jobs are pending with “Priority”, one or more other jobs will be pending with “Resources”.
So a few questions:
1. What are the “Resources” jobs waiting on, resource-wise? 2. When are they scheduled to start? 3. Can your array jobs backfill into the idle resources and finish before the “Resources” jobs are scheduled to start?
From: Long, Daniel S. via slurm-users slurm-users@lists.schedmd.com Date: Tuesday, September 24, 2024 at 11:47 AM To: slurm-users@schedmd.com slurm-users@schedmd.com Subject: [slurm-users] Jobs pending with reason "priority" but nodes are idle
External Email Warning
This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
________________________________ Hi,
On our cluster we have some jobs that are queued even though there are available nodes to run on. The listed reason is “priority” but that doesn’t really make sense to me. Slurm isn’t picking another job to run on those nodes; it’s just not running anything at all. We do have a quite heterogeneous cluster, but as far as I can tell the queued jobs aren’t requesting anything that would preclude them from running on the idle nodes. They are array jobs, if that makes a difference.
Thanks for any help you all can provide.
You might need to do some tuning on your backfill loop as that loop should be the one that backfills in those lower priority jobs. I would also look to see if those lower priority jobs will actually fit in prior to the higher priority job running, they may not.
-Paul Edmon-
On 9/24/24 2:19 PM, Long, Daniel S. via slurm-users wrote:
I experimented a bit and think I have figured out the problem but not the solution.
We use multifactor priority with the job account the primary factor. Right now one project has much higher priority due to a deadline. Those are the jobs that are pending with “Resources”. They cannot run on the idle nodes because they do not satisfy the resource requirements (don’t have GPUs). What I don’t understand is why slurm doesn’t schedule the lower priority jobs onto those nodes, since those jobs don’t require GPUs. It’s very unexpected behavior, to me. Is there an option somewhere I need to set?
*From: *"Renfro, Michael" Renfro@tntech.edu *Date: *Tuesday, September 24, 2024 at 1:54 PM *To: *Daniel Long Daniel.Long@gtri.gatech.edu, "slurm-users@schedmd.com" slurm-users@schedmd.com *Subject: *Re: Jobs pending with reason "priority" but nodes are idle
In theory, if jobs are pending with “Priority”, one or more other jobs will be pending with “Resources”.
So a few questions:
- What are the “Resources” jobs waiting on, resource-wise?
- When are they scheduled to start?
- Can your array jobs backfill into the idle resources and finish before the “Resources” jobs are scheduled to start?
*From: *Long, Daniel S. via slurm-users slurm-users@lists.schedmd.com *Date: *Tuesday, September 24, 2024 at 11:47 AM *To: *slurm-users@schedmd.com slurm-users@schedmd.com *Subject: *[slurm-users] Jobs pending with reason "priority" but nodes are idle
*External Email Warning*
*This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.*
Hi,
On our cluster we have some jobs that are queued even though there are available nodes to run on. The listed reason is “priority” but that doesn’t really make sense to me. Slurm isn’t picking another job to run on those nodes; it’s just not running anything at all. We do have a quite heterogeneous cluster, but as far as I can tell the queued jobs aren’t requesting anything that would preclude them from running on the idle nodes. They are array jobs, if that makes a difference.
Thanks for any help you all can provide.
The low priority jobs definitely can’t “fit in” before the high priority jobs would start, but I don’t think that should matter. The idle nodes are incapable of running the high priority jobs, ever. I would expect slurm to assign those nodes the highest priority jobs that they are capable of running.
From: Paul Edmon via slurm-users slurm-users@lists.schedmd.com Reply-To: Paul Edmon pedmon@cfa.harvard.edu Date: Tuesday, September 24, 2024 at 2:26 PM To: "slurm-users@lists.schedmd.com" slurm-users@lists.schedmd.com Subject: [slurm-users] Re: Jobs pending with reason "priority" but nodes are idle
You might need to do some tuning on your backfill loop as that loop should be the one that backfills in those lower priority jobs. I would also look to see if those lower priority jobs will actually fit in prior to the higher priority job running, they may not.
-Paul Edmon- On 9/24/24 2:19 PM, Long, Daniel S. via slurm-users wrote: I experimented a bit and think I have figured out the problem but not the solution.
We use multifactor priority with the job account the primary factor. Right now one project has much higher priority due to a deadline. Those are the jobs that are pending with “Resources”. They cannot run on the idle nodes because they do not satisfy the resource requirements (don’t have GPUs). What I don’t understand is why slurm doesn’t schedule the lower priority jobs onto those nodes, since those jobs don’t require GPUs. It’s very unexpected behavior, to me. Is there an option somewhere I need to set?
From: "Renfro, Michael" Renfro@tntech.edumailto:Renfro@tntech.edu Date: Tuesday, September 24, 2024 at 1:54 PM To: Daniel Long Daniel.Long@gtri.gatech.edumailto:Daniel.Long@gtri.gatech.edu, "slurm-users@schedmd.com"mailto:slurm-users@schedmd.com slurm-users@schedmd.commailto:slurm-users@schedmd.com Subject: Re: Jobs pending with reason "priority" but nodes are idle
In theory, if jobs are pending with “Priority”, one or more other jobs will be pending with “Resources”.
So a few questions:
1. What are the “Resources” jobs waiting on, resource-wise? 2. When are they scheduled to start? 3. Can your array jobs backfill into the idle resources and finish before the “Resources” jobs are scheduled to start?
From: Long, Daniel S. via slurm-users slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com Date: Tuesday, September 24, 2024 at 11:47 AM To: slurm-users@schedmd.commailto:slurm-users@schedmd.com slurm-users@schedmd.commailto:slurm-users@schedmd.com Subject: [slurm-users] Jobs pending with reason "priority" but nodes are idle
External Email Warning
This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
________________________________ Hi,
On our cluster we have some jobs that are queued even though there are available nodes to run on. The listed reason is “priority” but that doesn’t really make sense to me. Slurm isn’t picking another job to run on those nodes; it’s just not running anything at all. We do have a quite heterogeneous cluster, but as far as I can tell the queued jobs aren’t requesting anything that would preclude them from running on the idle nodes. They are array jobs, if that makes a difference.
Thanks for any help you all can provide.
Since nobody replied after this, if the nodes are incapable of running the jobs due to insufficient resources, it may be that the default “EnforcePartLimits=No” [1] might be an issue. That might allow a job to stay queued even if it’s impossible to run.
[1] https://slurm.schedmd.com/slurm.conf.html#OPT_EnforcePartLimits
From: Long, Daniel S. via slurm-users slurm-users@lists.schedmd.com Date: Tuesday, September 24, 2024 at 1:39 PM To: Paul Edmon pedmon@cfa.harvard.edu, slurm-users@lists.schedmd.com slurm-users@lists.schedmd.com Subject: [slurm-users] Re: Jobs pending with reason "priority" but nodes are idle
External Email Warning
This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
________________________________ The low priority jobs definitely can’t “fit in” before the high priority jobs would start, but I don’t think that should matter. The idle nodes are incapable of running the high priority jobs, ever. I would expect slurm to assign those nodes the highest priority jobs that they are capable of running.
From: Paul Edmon via slurm-users slurm-users@lists.schedmd.com Reply-To: Paul Edmon pedmon@cfa.harvard.edu Date: Tuesday, September 24, 2024 at 2:26 PM To: "slurm-users@lists.schedmd.com" slurm-users@lists.schedmd.com Subject: [slurm-users] Re: Jobs pending with reason "priority" but nodes are idle
You might need to do some tuning on your backfill loop as that loop should be the one that backfills in those lower priority jobs. I would also look to see if those lower priority jobs will actually fit in prior to the higher priority job running, they may not.
-Paul Edmon- On 9/24/24 2:19 PM, Long, Daniel S. via slurm-users wrote: I experimented a bit and think I have figured out the problem but not the solution.
We use multifactor priority with the job account the primary factor. Right now one project has much higher priority due to a deadline. Those are the jobs that are pending with “Resources”. They cannot run on the idle nodes because they do not satisfy the resource requirements (don’t have GPUs). What I don’t understand is why slurm doesn’t schedule the lower priority jobs onto those nodes, since those jobs don’t require GPUs. It’s very unexpected behavior, to me. Is there an option somewhere I need to set?
From: "Renfro, Michael" Renfro@tntech.edumailto:Renfro@tntech.edu Date: Tuesday, September 24, 2024 at 1:54 PM To: Daniel Long Daniel.Long@gtri.gatech.edumailto:Daniel.Long@gtri.gatech.edu, "slurm-users@schedmd.com"mailto:slurm-users@schedmd.com slurm-users@schedmd.commailto:slurm-users@schedmd.com Subject: Re: Jobs pending with reason "priority" but nodes are idle
In theory, if jobs are pending with “Priority”, one or more other jobs will be pending with “Resources”.
So a few questions:
1. What are the “Resources” jobs waiting on, resource-wise? 2. When are they scheduled to start? 3. Can your array jobs backfill into the idle resources and finish before the “Resources” jobs are scheduled to start?
From: Long, Daniel S. via slurm-users slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com Date: Tuesday, September 24, 2024 at 11:47 AM To: slurm-users@schedmd.commailto:slurm-users@schedmd.com slurm-users@schedmd.commailto:slurm-users@schedmd.com Subject: [slurm-users] Jobs pending with reason "priority" but nodes are idle
External Email Warning
This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
________________________________ Hi,
On our cluster we have some jobs that are queued even though there are available nodes to run on. The listed reason is “priority” but that doesn’t really make sense to me. Slurm isn’t picking another job to run on those nodes; it’s just not running anything at all. We do have a quite heterogeneous cluster, but as far as I can tell the queued jobs aren’t requesting anything that would preclude them from running on the idle nodes. They are array jobs, if that makes a difference.
Thanks for any help you all can provide.