How can I make sure my user have only one job per node (Job array --exclusive=user,)

List overview All Threads
Download

newer

older

slurmd on a warwwulf node - not...

multiple conf-server entries for...

Oren

3 Dec 2024 3 Dec '24

7:33 p.m.

Hi, I have a cluster of 20-nodes, and I want to run a jobarray on that cluster, but I want each node to get one job per node.

When I do the following: #!/bin/bash #SBATCH --job-name=process_images_train # Job name #SBATCH --time=50:00:00 # Time limit hrs:min:sec #SBATCH --tasks=1 #SBATCH --cpus-per-task=4 #SBATCH --mem=50000 #SBATCH --array=0-19# 19 # Job array with 20 jobs (0 to 19)

I get 10 jobs in node #1 and 10 jobs in node #20, I want a job in each node.

I've tried: #SBATCH --exclusive=user Also #SBATCH --spread-job #SBATCH --distribution=cyclic

Nothing changes, node#1 got 10 jobs and node#2 got 10 jobs.

Thanks

Attachments:

attachment.html (text/html — 2.0 KB)

Show replies by date

Renfro, Michael

3 Dec 3 Dec

8:10 p.m.

I’ll start with the question of “why spread the jobs out more than required?” and move on to why the other items didn’t work:

1. exclusive only ensures that others’ jobs don’t run on a node with your jobs, and does nothing about other jobs you own. 2. spread-job distributes the work of one job across multiple nodes, but does nothing about multiple jobs 3. distribution also distributes the work of one job

You might get something similar to what you want by changing the scheduler to use CR_LLN instead of CR_Core_Memory (or whatever you’re using), but that’ll potentially have serious side effects for others’ jobs.

So back to the original question: why *not* pack 20 jobs onto fewer nodes if those nodes have the capacity to run the full set of jobs? You shouldn’t have a constraint with memory or CPUs. Are you trying to spread out an I/O load somehow? Networking?

From: Oren via slurm-users slurm-users@lists.schedmd.com Date: Tuesday, December 3, 2024 at 1:35 PM To: slurm-users@schedmd.com slurm-users@schedmd.com Subject: [slurm-users] How can I make sure my user have only one job per node (Job array --exclusive=user,)

External Email Warning

This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.

________________________________ Hi, I have a cluster of 20-nodes, and I want to run a jobarray on that cluster, but I want each node to get one job per node.

I get 10 jobs in node #1 and 10 jobs in node #20, I want a job in each node.

I've tried: #SBATCH --exclusive=user Also #SBATCH --spread-job #SBATCH --distribution=cyclic

Nothing changes, node#1 got 10 jobs and node#2 got 10 jobs.

Thanks

Oren

9:15 p.m.

Thank you Michael, yeah, you guessed right, Networking. My job is mostly IO (Networking) intensive, my nodes connect to the network via a non blocking switch, but the ethernet cards are not the best, So I don't need many CPUs per node, but I do want to run on all nodes to fully utilize the network connection that each node has.

Assuming I don't want to change the scheduler, is there anything else I can do? Thanks, Oren

On Tue, 3 Dec 2024 at 15:10, Renfro, Michael Renfro@tntech.edu wrote:

...

I’ll start with the question of “why spread the jobs out more than required?” and move on to why the other items didn’t work:

exclusive only ensures that others’ jobs don’t run on a node with

your jobs, and does nothing about other jobs you own. 2. spread-job distributes the work of one job across multiple nodes, but does nothing about multiple jobs 3. distribution also distributes the work of one job

You might get something similar to what you want by changing the scheduler to use CR_LLN instead of CR_Core_Memory (or whatever you’re using), but that’ll potentially have serious side effects for others’ jobs.

So back to the original question: why **not** pack 20 jobs onto fewer nodes if those nodes have the capacity to run the full set of jobs? You shouldn’t have a constraint with memory or CPUs. Are you trying to spread out an I/O load somehow? Networking?

*From: *Oren via slurm-users slurm-users@lists.schedmd.com *Date: *Tuesday, December 3, 2024 at 1:35 PM *To: *slurm-users@schedmd.com slurm-users@schedmd.com *Subject: *[slurm-users] How can I make sure my user have only one job per node (Job array --exclusive=user,)

*External Email Warning*

*This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.*

Hi, I have a cluster of 20-nodes, and I want to run a jobarray on that cluster, but I want each node to get one job per node.

When I do the following:

#!/bin/bash

#SBATCH --job-name=process_images_train # Job name

#SBATCH --time=50:00:00 # Time limit hrs:min:sec

#SBATCH --tasks=1

#SBATCH --cpus-per-task=4

#SBATCH --mem=50000

#SBATCH --array=0-19# 19 # Job array with 20 jobs (0 to 19)

I get 10 jobs in node #1 and 10 jobs in node #20, I want a job in each node.

I've tried: #SBATCH --exclusive=user

Also

#SBATCH --spread-job

#SBATCH --distribution=cyclic

Nothing changes, node#1 got 10 jobs and node#2 got 10 jobs.

Thanks

Renfro, Michael

9:24 p.m.

I’ve never done this myself, but others probably have. At the end of [1], there’s an example of making a generic resource for bandwidth. You could set that to any convenient units (bytes/second or bits/second, most likely), and assign your nodes a certain amount. Then any network-intensive job could reserve all the node’s bandwidth, without locking other less-intensive jobs off the node. It’s identical to reserving 1 or more GPUs per node, just without any hardware permissions.

[1] https://slurm.schedmd.com/gres.conf.html#SECTION_EXAMPLES

From: Oren oren.a4@gmail.com Date: Tuesday, December 3, 2024 at 3:15 PM To: Renfro, Michael Renfro@tntech.edu Cc: slurm-users@schedmd.com slurm-users@schedmd.com Subject: Re: [slurm-users] How can I make sure my user have only one job per node (Job array --exclusive=user,)

External Email Warning

This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.

________________________________ Thank you Michael, yeah, you guessed right, Networking. My job is mostly IO (Networking) intensive, my nodes connect to the network via a non blocking switch, but the ethernet cards are not the best, So I don't need many CPUs per node, but I do want to run on all nodes to fully utilize the network connection that each node has.

Assuming I don't want to change the scheduler, is there anything else I can do? Thanks, Oren

On Tue, 3 Dec 2024 at 15:10, Renfro, Michael <Renfro@tntech.edumailto:Renfro@tntech.edu> wrote: I’ll start with the question of “why spread the jobs out more than required?” and move on to why the other items didn’t work:

From: Oren via slurm-users <slurm-users@lists.schedmd.commailto:slurm-users@lists.schedmd.com> Date: Tuesday, December 3, 2024 at 1:35 PM To: slurm-users@schedmd.commailto:slurm-users@schedmd.com <slurm-users@schedmd.commailto:slurm-users@schedmd.com> Subject: [slurm-users] How can I make sure my user have only one job per node (Job array --exclusive=user,)

External Email Warning

This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.

________________________________ Hi, I have a cluster of 20-nodes, and I want to run a jobarray on that cluster, but I want each node to get one job per node.

I get 10 jobs in node #1 and 10 jobs in node #20, I want a job in each node.

I've tried: #SBATCH --exclusive=user Also #SBATCH --spread-job #SBATCH --distribution=cyclic

Nothing changes, node#1 got 10 jobs and node#2 got 10 jobs.

Thanks

Oren

9:36 p.m.

Thanks, nice workaround. It will be great if there was a way to actually set it so that one can use only one node per job, a bit like ---exclusive. Thanks

On Tue, 3 Dec 2024 at 16:24, Renfro, Michael Renfro@tntech.edu wrote:

...

I’ve never done this myself, but others probably have. At the end of [1], there’s an example of making a generic resource for bandwidth. You could set that to any convenient units (bytes/second or bits/second, most likely), and assign your nodes a certain amount. Then any network-intensive job could reserve all the node’s bandwidth, without locking other less-intensive jobs off the node. It’s identical to reserving 1 or more GPUs per node, just without any hardware permissions.

[1] https://slurm.schedmd.com/gres.conf.html#SECTION_EXAMPLES

*From: *Oren oren.a4@gmail.com *Date: *Tuesday, December 3, 2024 at 3:15 PM *To: *Renfro, Michael Renfro@tntech.edu *Cc: *slurm-users@schedmd.com slurm-users@schedmd.com *Subject: *Re: [slurm-users] How can I make sure my user have only one job per node (Job array --exclusive=user,)

*External Email Warning*

*This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.*

Thank you Michael,

yeah, you guessed right, Networking. My job is mostly IO (Networking) intensive, my nodes connect to the network via a non blocking switch, but the ethernet cards are not the best,

So I don't need many CPUs per node, but I do want to run on all nodes to fully utilize the network connection that each node has.

Assuming I don't want to change the scheduler, is there anything else I can do?

Thanks,

Oren

On Tue, 3 Dec 2024 at 15:10, Renfro, Michael Renfro@tntech.edu wrote:

I’ll start with the question of “why spread the jobs out more than required?” and move on to why the other items didn’t work:

exclusive only ensures that others’ jobs don’t run on a node with

your jobs, and does nothing about other jobs you own. 2. spread-job distributes the work of one job across multiple nodes, but does nothing about multiple jobs 3. distribution also distributes the work of one job

You might get something similar to what you want by changing the scheduler to use CR_LLN instead of CR_Core_Memory (or whatever you’re using), but that’ll potentially have serious side effects for others’ jobs.

So back to the original question: why **not** pack 20 jobs onto fewer nodes if those nodes have the capacity to run the full set of jobs? You shouldn’t have a constraint with memory or CPUs. Are you trying to spread out an I/O load somehow? Networking?

*From: *Oren via slurm-users slurm-users@lists.schedmd.com *Date: *Tuesday, December 3, 2024 at 1:35 PM *To: *slurm-users@schedmd.com slurm-users@schedmd.com *Subject: *[slurm-users] How can I make sure my user have only one job per node (Job array --exclusive=user,)

*External Email Warning*

*This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.*

Hi, I have a cluster of 20-nodes, and I want to run a jobarray on that cluster, but I want each node to get one job per node.

When I do the following:

#!/bin/bash

#SBATCH --job-name=process_images_train # Job name

#SBATCH --time=50:00:00 # Time limit hrs:min:sec

#SBATCH --tasks=1

#SBATCH --cpus-per-task=4

#SBATCH --mem=50000

#SBATCH --array=0-19# 19 # Job array with 20 jobs (0 to 19)

I get 10 jobs in node #1 and 10 jobs in node #20, I want a job in each node.

I've tried: #SBATCH --exclusive=user

Also

#SBATCH --spread-job

#SBATCH --distribution=cyclic

Nothing changes, node#1 got 10 jobs and node#2 got 10 jobs.

Thanks

Renfro, Michael

9:40 p.m.

As Thomas had mentioned earlier in the thread, there is --exclusive with no extra additions. But that’d prevent *every* other job from running on that node, which unless this is a cluster for you and you alone, sounds like wasting 90% of the resources. I’d be most perturbed at a user doing that here without some astoundingly good reasons.

From: Oren oren.a4@gmail.com Date: Tuesday, December 3, 2024 at 3:36 PM To: Renfro, Michael Renfro@tntech.edu Cc: slurm-users@schedmd.com slurm-users@schedmd.com Subject: Re: [slurm-users] How can I make sure my user have only one job per node (Job array --exclusive=user,)

External Email Warning

This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.

________________________________ Thanks, nice workaround. It will be great if there was a way to actually set it so that one can use only one node per job, a bit like ---exclusive. Thanks

On Tue, 3 Dec 2024 at 16:24, Renfro, Michael <Renfro@tntech.edumailto:Renfro@tntech.edu> wrote: I’ve never done this myself, but others probably have. At the end of [1], there’s an example of making a generic resource for bandwidth. You could set that to any convenient units (bytes/second or bits/second, most likely), and assign your nodes a certain amount. Then any network-intensive job could reserve all the node’s bandwidth, without locking other less-intensive jobs off the node. It’s identical to reserving 1 or more GPUs per node, just without any hardware permissions.

[1] https://slurm.schedmd.com/gres.conf.html#SECTION_EXAMPLES

From: Oren <oren.a4@gmail.commailto:oren.a4@gmail.com> Date: Tuesday, December 3, 2024 at 3:15 PM To: Renfro, Michael <Renfro@tntech.edumailto:Renfro@tntech.edu> Cc: slurm-users@schedmd.commailto:slurm-users@schedmd.com <slurm-users@schedmd.commailto:slurm-users@schedmd.com> Subject: Re: [slurm-users] How can I make sure my user have only one job per node (Job array --exclusive=user,)

External Email Warning

This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.

Assuming I don't want to change the scheduler, is there anything else I can do? Thanks, Oren

External Email Warning

This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.

________________________________ Hi, I have a cluster of 20-nodes, and I want to run a jobarray on that cluster, but I want each node to get one job per node.

I get 10 jobs in node #1 and 10 jobs in node #20, I want a job in each node.

I've tried: #SBATCH --exclusive=user Also #SBATCH --spread-job #SBATCH --distribution=cyclic

Nothing changes, node#1 got 10 jobs and node#2 got 10 jobs.

Thanks

Oren

11:25 p.m.

Thanks, but yeah I do not want to use ` --exclusive` I just want it to be exclusive for me.. Thanks

On Tue, 3 Dec 2024 at 16:40, Renfro, Michael Renfro@tntech.edu wrote:

...

As Thomas had mentioned earlier in the thread, there is --exclusive with no extra additions. But that’d prevent **every** other job from running on that node, which unless this is a cluster for you and you alone, sounds like wasting 90% of the resources. I’d be most perturbed at a user doing that here without some astoundingly good reasons.

*From: *Oren oren.a4@gmail.com *Date: *Tuesday, December 3, 2024 at 3:36 PM *To: *Renfro, Michael Renfro@tntech.edu *Cc: *slurm-users@schedmd.com slurm-users@schedmd.com *Subject: *Re: [slurm-users] How can I make sure my user have only one job per node (Job array --exclusive=user,)

*External Email Warning*

*This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.*

Thanks, nice workaround. It will be great if there was a way to actually set it so that one can use only one node per job, a bit like ---exclusive. Thanks

On Tue, 3 Dec 2024 at 16:24, Renfro, Michael Renfro@tntech.edu wrote:

I’ve never done this myself, but others probably have. At the end of [1], there’s an example of making a generic resource for bandwidth. You could set that to any convenient units (bytes/second or bits/second, most likely), and assign your nodes a certain amount. Then any network-intensive job could reserve all the node’s bandwidth, without locking other less-intensive jobs off the node. It’s identical to reserving 1 or more GPUs per node, just without any hardware permissions.

[1] https://slurm.schedmd.com/gres.conf.html#SECTION_EXAMPLES

*From: *Oren oren.a4@gmail.com *Date: *Tuesday, December 3, 2024 at 3:15 PM *To: *Renfro, Michael Renfro@tntech.edu *Cc: *slurm-users@schedmd.com slurm-users@schedmd.com *Subject: *Re: [slurm-users] How can I make sure my user have only one job per node (Job array --exclusive=user,)

*External Email Warning*

*This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.*

Thank you Michael,

yeah, you guessed right, Networking. My job is mostly IO (Networking) intensive, my nodes connect to the network via a non blocking switch, but the ethernet cards are not the best,

So I don't need many CPUs per node, but I do want to run on all nodes to fully utilize the network connection that each node has.

Assuming I don't want to change the scheduler, is there anything else I can do?

Thanks,

Oren

On Tue, 3 Dec 2024 at 15:10, Renfro, Michael Renfro@tntech.edu wrote:

I’ll start with the question of “why spread the jobs out more than required?” and move on to why the other items didn’t work:

exclusive only ensures that others’ jobs don’t run on a node with

your jobs, and does nothing about other jobs you own. 2. spread-job distributes the work of one job across multiple nodes, but does nothing about multiple jobs 3. distribution also distributes the work of one job

You might get something similar to what you want by changing the scheduler to use CR_LLN instead of CR_Core_Memory (or whatever you’re using), but that’ll potentially have serious side effects for others’ jobs.

So back to the original question: why **not** pack 20 jobs onto fewer nodes if those nodes have the capacity to run the full set of jobs? You shouldn’t have a constraint with memory or CPUs. Are you trying to spread out an I/O load somehow? Networking?

*From: *Oren via slurm-users slurm-users@lists.schedmd.com *Date: *Tuesday, December 3, 2024 at 1:35 PM *To: *slurm-users@schedmd.com slurm-users@schedmd.com *Subject: *[slurm-users] How can I make sure my user have only one job per node (Job array --exclusive=user,)

*External Email Warning*

*This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.*

Hi, I have a cluster of 20-nodes, and I want to run a jobarray on that cluster, but I want each node to get one job per node.

When I do the following:

#!/bin/bash

#SBATCH --job-name=process_images_train # Job name

#SBATCH --time=50:00:00 # Time limit hrs:min:sec

#SBATCH --tasks=1

#SBATCH --cpus-per-task=4

#SBATCH --mem=50000

#SBATCH --array=0-19# 19 # Job array with 20 jobs (0 to 19)

I get 10 jobs in node #1 and 10 jobs in node #20, I want a job in each node.

I've tried: #SBATCH --exclusive=user

Also

#SBATCH --spread-job

#SBATCH --distribution=cyclic

Nothing changes, node#1 got 10 jobs and node#2 got 10 jobs.

Thanks

240

Age (days ago)

240

Last active (days ago)

slurm-users@lists.schedmd.com

6 comments

2 participants

tags (0)

participants (2)

Oren
Renfro, Michael