scheduling according time requirements

List overview All Threads
Download

newer

older

slurm reservation: how to use more...

Trying to Track Down root Usage

Dietmar Rieder

30 Apr 2024 30 Apr '24

1:01 p.m.

Hi,

is it possible to have slurm scheduling jobs automatical according to the "-t" time requirements to a fitting partition?

e.g. 3 partitions

PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00 DefaultTime=00:10:00 State=UP OverSubscribe=NO PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00 DefaultTime=04:00:00 State=UP OverSubscribe=NO PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00 DefaultTime=24:00:00 State=UP OverSubscribe=NO

So in the standard partition which is the default we have all nodes and a max time of 4h, in the medium partition we have 4 nodes with a max time of 24h and in the long partition we have 2 nodes with a max time of 336h.

I was hoping that if I submit a job with -t 01:00:00 it can be run on any node (standard partition), whereas when specifying -t 05:00:00 or -t 48:00:00 the job will run on the nodes of the medium or long partition respectively.

However, my job will not get scheduled at all when -t is greater than 01:00:00

i.e.

]$ srun --cpus-per-task 1 -t 01:00:01 --pty bash srun: Requested partition configuration not available now srun: job 42095 queued and waiting for resources

it will wait forever because the standard partition is selected, I was thinking that slurm would automatically switch to the medium partition.

Do I misunderstand something there? Or can this be somehow configured.

Thanks so much and sorry for the naive question Dietmar

Attachments:

OpenPGP_signature.asc (application/pgp-signature — 665 bytes)

Show replies by date

Loris Bennett

30 Apr 30 Apr

1:53 p.m.

Hi Dietmar,

Dietmar Rieder via slurm-users slurm-users@lists.schedmd.com writes:

...

Hi,

is it possible to have slurm scheduling jobs automatical according to the "-t" time requirements to a fitting partition?

e.g. 3 partitions

PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00 DefaultTime=00:10:00 State=UP OverSubscribe=NO PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00 DefaultTime=04:00:00 State=UP OverSubscribe=NO PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00 DefaultTime=24:00:00 State=UP OverSubscribe=NO

So in the standard partition which is the default we have all nodes and a max time of 4h, in the medium partition we have 4 nodes with a max time of 24h and in the long partition we have 2 nodes with a max time of 336h.

I was hoping that if I submit a job with -t 01:00:00 it can be run on any node (standard partition), whereas when specifying -t 05:00:00 or -t 48:00:00 the job will run on the nodes of the medium or long partition respectively.

However, my job will not get scheduled at all when -t is greater than 01:00:00

i.e.

]$ srun --cpus-per-task 1 -t 01:00:01 --pty bash srun: Requested partition configuration not available now srun: job 42095 queued and waiting for resources

it will wait forever because the standard partition is selected, I was thinking that slurm would automatically switch to the medium partition.

Do I misunderstand something there? Or can this be somehow configured.

You can specify multiple partitions, e.g.

$ salloc --cpus-per-task=1 --time=01:00:01 --partition=standard,medium,long

Notice that rather than using 'srun ... --pty bash', as far as I understand, the preferred method is to use 'salloc' as above, and to use 'srun' for starting MPI processes.

Cheers,

Loris

...

Thanks so much and sorry for the naive question Dietmar

-- Dr. Loris Bennett (Herr/Mr) FUB-IT (ex-ZEDAT), Freie Universität Berlin

Dietmar Rieder

2:11 p.m.

New subject: [EXTERN] Re: scheduling according time requirements

Hi Loris,

On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote:

...

Hi Dietmar,

Dietmar Rieder via slurm-users slurm-users@lists.schedmd.com writes:

...
Hi,

is it possible to have slurm scheduling jobs automatical according to the "-t" time requirements to a fitting partition?

e.g. 3 partitions

PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00 DefaultTime=00:10:00 State=UP OverSubscribe=NO PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00 DefaultTime=04:00:00 State=UP OverSubscribe=NO PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00 DefaultTime=24:00:00 State=UP OverSubscribe=NO

So in the standard partition which is the default we have all nodes and a max time of 4h, in the medium partition we have 4 nodes with a max time of 24h and in the long partition we have 2 nodes with a max time of 336h.

I was hoping that if I submit a job with -t 01:00:00 it can be run on any node (standard partition), whereas when specifying -t 05:00:00 or -t 48:00:00 the job will run on the nodes of the medium or long partition respectively.

However, my job will not get scheduled at all when -t is greater than 01:00:00

i.e.

]$ srun --cpus-per-task 1 -t 01:00:01 --pty bash srun: Requested partition configuration not available now srun: job 42095 queued and waiting for resources

it will wait forever because the standard partition is selected, I was thinking that slurm would automatically switch to the medium partition.

Do I misunderstand something there? Or can this be somehow configured.

You can specify multiple partitions, e.g.

$ salloc --cpus-per-task=1 --time=01:00:01 --partition=standard,medium,long

Notice that rather than using 'srun ... --pty bash', as far as I understand, the preferred method is to use 'salloc' as above, and to use 'srun' for starting MPI processes.

Thanks for the hint. This works nicely, but it would be nice that I would not need to specify the partition at all. Any thoughts?

Dietmar

Diego Zuccato

2:24 p.m.

New subject: [EXTERN] Re: scheduling according time requirements

Try adding to the config: EnforcePartLimits=ANY JobSubmitPlugins=all_partitions

Diego

Il 30/04/2024 15:11, Dietmar Rieder via slurm-users ha scritto:

...

Hi Loris,

On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote:

...
Hi Dietmar,

Dietmar Rieder via slurm-users slurm-users@lists.schedmd.com writes:

...
Hi,

is it possible to have slurm scheduling jobs automatical according to the "-t" time requirements to a fitting partition?

e.g. 3 partitions

PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00 DefaultTime=00:10:00 State=UP OverSubscribe=NO PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00 DefaultTime=04:00:00 State=UP OverSubscribe=NO PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00 DefaultTime=24:00:00 State=UP OverSubscribe=NO

So in the standard partition which is the default we have all nodes and a max time of 4h, in the medium partition we have 4 nodes with a max time of 24h and in the long partition we have 2 nodes with a max time of 336h.

I was hoping that if I submit a job with -t 01:00:00 it can be run on any node (standard partition), whereas when specifying -t 05:00:00 or -t 48:00:00 the job will run on the nodes of the medium or long partition respectively.

However, my job will not get scheduled at all when -t is greater than 01:00:00

i.e.

]$ srun --cpus-per-task 1 -t 01:00:01 --pty bash srun: Requested partition configuration not available now srun: job 42095 queued and waiting for resources

it will wait forever because the standard partition is selected, I was thinking that slurm would automatically switch to the medium partition.

Do I misunderstand something there? Or can this be somehow configured.

You can specify multiple partitions, e.g. $ salloc --cpus-per-task=1 --time=01:00:01 --partition=standard,medium,long

Notice that rather than using 'srun ... --pty bash', as far as I understand, the preferred method is to use 'salloc' as above, and to use 'srun' for starting MPI processes.

Thanks for the hint. This works nicely, but it would be nice that I would not need to specify the partition at all. Any thoughts?

Dietmar

-- Diego Zuccato DIFA - Dip. di Fisica e Astronomia Servizi Informatici Alma Mater Studiorum - Università di Bologna V.le Berti-Pichat 6/2 - 40127 Bologna - Italy tel.: +39 051 20 95786

Dietmar Rieder

2:53 p.m.

New subject: [EXTERN] Re: scheduling according time requirements

Hi Diego,

thanks a lot, it seems to work as far as I was able to test now.

Dietmar

On 4/30/24 3:24 PM, Diego Zuccato via slurm-users wrote:

...

Try adding to the config: EnforcePartLimits=ANY JobSubmitPlugins=all_partitions

Diego

Il 30/04/2024 15:11, Dietmar Rieder via slurm-users ha scritto:

...
Hi Loris,

On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote:

...
Hi Dietmar,

Dietmar Rieder via slurm-users slurm-users@lists.schedmd.com writes:

...
Hi,

is it possible to have slurm scheduling jobs automatical according to the "-t" time requirements to a fitting partition?

e.g. 3 partitions

PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00 DefaultTime=00:10:00 State=UP OverSubscribe=NO PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00 DefaultTime=04:00:00 State=UP OverSubscribe=NO PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00 DefaultTime=24:00:00 State=UP OverSubscribe=NO

So in the standard partition which is the default we have all nodes and a max time of 4h, in the medium partition we have 4 nodes with a max time of 24h and in the long partition we have 2 nodes with a max time of 336h.

I was hoping that if I submit a job with -t 01:00:00 it can be run on any node (standard partition), whereas when specifying -t 05:00:00 or -t 48:00:00 the job will run on the nodes of the medium or long partition respectively.

However, my job will not get scheduled at all when -t is greater than 01:00:00

i.e.

]$ srun --cpus-per-task 1 -t 01:00:01 --pty bash srun: Requested partition configuration not available now srun: job 42095 queued and waiting for resources

it will wait forever because the standard partition is selected, I was thinking that slurm would automatically switch to the medium partition.

Do I misunderstand something there? Or can this be somehow configured.

You can specify multiple partitions, e.g. $ salloc --cpus-per-task=1 --time=01:00:01 --partition=standard,medium,long

Notice that rather than using 'srun ... --pty bash', as far as I understand, the preferred method is to use 'salloc' as above, and to use 'srun' for starting MPI processes.

Thanks for the hint. This works nicely, but it would be nice that I would not need to specify the partition at all. Any thoughts?

Dietmar

Loris Bennett

2:43 p.m.

New subject: [EXTERN] Re: scheduling according time requirements

Hi Dietmar,

Dietmar Rieder via slurm-users slurm-users@lists.schedmd.com writes:

...

Hi Loris,

On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote:

...
Hi Dietmar, Dietmar Rieder via slurm-users slurm-users@lists.schedmd.com writes:

...
Hi,

is it possible to have slurm scheduling jobs automatical according to the "-t" time requirements to a fitting partition?

e.g. 3 partitions

PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00 DefaultTime=00:10:00 State=UP OverSubscribe=NO PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00 DefaultTime=04:00:00 State=UP OverSubscribe=NO PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00 DefaultTime=24:00:00 State=UP OverSubscribe=NO

So in the standard partition which is the default we have all nodes and a max time of 4h, in the medium partition we have 4 nodes with a max time of 24h and in the long partition we have 2 nodes with a max time of 336h.

I was hoping that if I submit a job with -t 01:00:00 it can be run on any node (standard partition), whereas when specifying -t 05:00:00 or -t 48:00:00 the job will run on the nodes of the medium or long partition respectively.

However, my job will not get scheduled at all when -t is greater than 01:00:00

i.e.

]$ srun --cpus-per-task 1 -t 01:00:01 --pty bash srun: Requested partition configuration not available now srun: job 42095 queued and waiting for resources

it will wait forever because the standard partition is selected, I was thinking that slurm would automatically switch to the medium partition.

Do I misunderstand something there? Or can this be somehow configured.

You can specify multiple partitions, e.g. $ salloc --cpus-per-task=1 --time=01:00:01 --partition=standard,medium,long Notice that rather than using 'srun ... --pty bash', as far as I understand, the preferred method is to use 'salloc' as above, and to use 'srun' for starting MPI processes.

Thanks for the hint. This works nicely, but it would be nice that I would not need to specify the partition at all. Any thoughts?

I am not aware that you can set multiple partition as a default.

The question is why you actually need partitions with different maximum runtimes.

In our case, a university cluster with a very wide range of codes and usage patterns, multiple partitions would probably lead to fragmentation and wastage of resources due to the job mix not always fitting well to the various partitions. Therefore, I am a member of the "as few partitions as possible" camp and so in our set-up we have as essentially only one partition with a DefaultTime of 14 days. We do however let users set a QOS to gain a priority boost in return for accepting a shorter run-time and a reduced maximum number of cores.

Occasionally people complain about short jobs having to wait in the queue for too long, but I have generally been successful in solving the problem by having them estimate their resource requirements better or bundling their work in ordert to increase the run-time-to-wait-time ratio.

Cheers,

Loris

-- Dr. Loris Bennett (Herr/Mr) FUB-IT (ex-ZEDAT), Freie Universität Berlin

Dietmar Rieder

3 p.m.

New subject: [EXTERN] Re: scheduling according time requirements

Hi Loris,

On 4/30/24 3:43 PM, Loris Bennett via slurm-users wrote:

...

Hi Dietmar,

Dietmar Rieder via slurm-users slurm-users@lists.schedmd.com writes:

...
Hi Loris,

On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote:

...
Hi Dietmar, Dietmar Rieder via slurm-users slurm-users@lists.schedmd.com writes:

...
Hi,

is it possible to have slurm scheduling jobs automatical according to the "-t" time requirements to a fitting partition?

e.g. 3 partitions

PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00 DefaultTime=00:10:00 State=UP OverSubscribe=NO PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00 DefaultTime=04:00:00 State=UP OverSubscribe=NO PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00 DefaultTime=24:00:00 State=UP OverSubscribe=NO

So in the standard partition which is the default we have all nodes and a max time of 4h, in the medium partition we have 4 nodes with a max time of 24h and in the long partition we have 2 nodes with a max time of 336h.

I was hoping that if I submit a job with -t 01:00:00 it can be run on any node (standard partition), whereas when specifying -t 05:00:00 or -t 48:00:00 the job will run on the nodes of the medium or long partition respectively.

However, my job will not get scheduled at all when -t is greater than 01:00:00

i.e.

]$ srun --cpus-per-task 1 -t 01:00:01 --pty bash srun: Requested partition configuration not available now srun: job 42095 queued and waiting for resources

it will wait forever because the standard partition is selected, I was thinking that slurm would automatically switch to the medium partition.

Do I misunderstand something there? Or can this be somehow configured.

You can specify multiple partitions, e.g. $ salloc --cpus-per-task=1 --time=01:00:01 --partition=standard,medium,long Notice that rather than using 'srun ... --pty bash', as far as I understand, the preferred method is to use 'salloc' as above, and to use 'srun' for starting MPI processes.

Thanks for the hint. This works nicely, but it would be nice that I would not need to specify the partition at all. Any thoughts?

I am not aware that you can set multiple partition as a default.

Diego suggested a possible way which seems to work after a quick test.

...

The question is why you actually need partitions with different maximum runtimes.

we would like to have only a sub set of the nodes in a partition for long running jobs, so that there are enough nodes available for short jobs.

The nodes for the long partition, however are also part of the short partition so they can also be utilized when no long jobs are running.

That's our idea....

...

In our case, a university cluster with a very wide range of codes and usage patterns, multiple partitions would probably lead to fragmentation and wastage of resources due to the job mix not always fitting well to the various partitions. Therefore, I am a member of the "as few partitions as possible" camp and so in our set-up we have as essentially only one partition with a DefaultTime of 14 days. We do however let users set a QOS to gain a priority boost in return for accepting a shorter run-time and a reduced maximum number of cores.

we didn't look into QOS yet, but this might also a way to go, thanks.

...

Occasionally people complain about short jobs having to wait in the queue for too long, but I have generally been successful in solving the problem by having them estimate their resource requirements better or bundling their work in ordert to increase the run-time-to-wait-time ratio.

Dietmar

Thomas Hartmann

3:26 p.m.

New subject: [EXTERN] Re: scheduling according time requirements

Hi Dietmar,

I was facing quite similar requirements to yours. We ended up using QoS instead of partitions because this approach provides higher flexibility and more features. The basic distinction between the two approaches is that partitions are node-based while QoS are (essentially) resource based. So, instead of saying "Long jobs can only run on nodes 9 and 10" you would be able to say "Long jobs can only use X CPU cores in total".

However, yes, your partition based approach is going to do the job, as long as you do not need any QoS based preemption.

Cheers,

Thomas

Am 30.04.24 um 16:00 schrieb Dietmar Rieder via slurm-users:

...

Hi Loris,

On 4/30/24 3:43 PM, Loris Bennett via slurm-users wrote:

...
Hi Dietmar,

Dietmar Rieder via slurm-users slurm-users@lists.schedmd.com writes:

...
Hi Loris,

On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote:

...
Hi Dietmar, Dietmar Rieder via slurm-users slurm-users@lists.schedmd.com writes:

...
Hi,

is it possible to have slurm scheduling jobs automatical according to the "-t" time requirements to a fitting partition?

e.g. 3 partitions

PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00 DefaultTime=00:10:00 State=UP OverSubscribe=NO PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00 DefaultTime=04:00:00 State=UP OverSubscribe=NO PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00 DefaultTime=24:00:00 State=UP OverSubscribe=NO

So in the standard partition which is the default we have all nodes and a max time of 4h, in the medium partition we have 4 nodes with a max time of 24h and in the long partition we have 2 nodes with a max time of 336h.

I was hoping that if I submit a job with -t 01:00:00 it can be run on any node (standard partition), whereas when specifying -t 05:00:00 or -t 48:00:00 the job will run on the nodes of the medium or long partition respectively.

However, my job will not get scheduled at all when -t is greater than 01:00:00

i.e.

]$ srun --cpus-per-task 1 -t 01:00:01 --pty bash srun: Requested partition configuration not available now srun: job 42095 queued and waiting for resources

it will wait forever because the standard partition is selected, I was thinking that slurm would automatically switch to the medium partition.

Do I misunderstand something there? Or can this be somehow configured.

You can specify multiple partitions, e.g. $ salloc --cpus-per-task=1 --time=01:00:01 --partition=standard,medium,long Notice that rather than using 'srun ... --pty bash', as far as I understand, the preferred method is to use 'salloc' as above, and to use 'srun' for starting MPI processes.

Thanks for the hint. This works nicely, but it would be nice that I would not need to specify the partition at all. Any thoughts?

I am not aware that you can set multiple partition as a default.

Diego suggested a possible way which seems to work after a quick test.

...
The question is why you actually need partitions with different maximum runtimes.

we would like to have only a sub set of the nodes in a partition for long running jobs, so that there are enough nodes available for short jobs.

The nodes for the long partition, however are also part of the short partition so they can also be utilized when no long jobs are running.

That's our idea....

...
In our case, a university cluster with a very wide range of codes and usage patterns, multiple partitions would probably lead to fragmentation and wastage of resources due to the job mix not always fitting well to the various partitions. Therefore, I am a member of the "as few partitions as possible" camp and so in our set-up we have as essentially only one partition with a DefaultTime of 14 days. We do however let users set a QOS to gain a priority boost in return for accepting a shorter run-time and a reduced maximum number of cores.

we didn't look into QOS yet, but this might also a way to go, thanks.

...
Occasionally people complain about short jobs having to wait in the queue for too long, but I have generally been successful in solving the problem by having them estimate their resource requirements better or bundling their work in ordert to increase the run-time-to-wait-time ratio.

Dietmar

Dietmar Rieder

4:47 p.m.

New subject: [EXTERN] Re: scheduling according time requirements

Dear Thomas,

the QoS seems really helpful, we'll look into it. Perhaps as a starting point for us could you eventually translate my simple example into a QoS config/setting?

Thanks so much Dietmar

On 4/30/24 4:26 PM, Thomas Hartmann via slurm-users wrote:

...

Hi Dietmar,

I was facing quite similar requirements to yours. We ended up using QoS instead of partitions because this approach provides higher flexibility and more features. The basic distinction between the two approaches is that partitions are node-based while QoS are (essentially) resource based. So, instead of saying "Long jobs can only run on nodes 9 and 10" you would be able to say "Long jobs can only use X CPU cores in total".

However, yes, your partition based approach is going to do the job, as long as you do not need any QoS based preemption.

Cheers,

Thomas

Am 30.04.24 um 16:00 schrieb Dietmar Rieder via slurm-users:

...
Hi Loris,

On 4/30/24 3:43 PM, Loris Bennett via slurm-users wrote:

...
Hi Dietmar,

Dietmar Rieder via slurm-users slurm-users@lists.schedmd.com writes:

...
Hi Loris,

On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote:

...
Hi Dietmar, Dietmar Rieder via slurm-users slurm-users@lists.schedmd.com writes:

...
Hi,

is it possible to have slurm scheduling jobs automatical according to the "-t" time requirements to a fitting partition?

e.g. 3 partitions

PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00 DefaultTime=00:10:00 State=UP OverSubscribe=NO PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00 DefaultTime=04:00:00 State=UP OverSubscribe=NO PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00 DefaultTime=24:00:00 State=UP OverSubscribe=NO

So in the standard partition which is the default we have all nodes and a max time of 4h, in the medium partition we have 4 nodes with a max time of 24h and in the long partition we have 2 nodes with a max time of 336h.

I was hoping that if I submit a job with -t 01:00:00 it can be run on any node (standard partition), whereas when specifying -t 05:00:00 or -t 48:00:00 the job will run on the nodes of the medium or long partition respectively.

However, my job will not get scheduled at all when -t is greater than 01:00:00

i.e.

]$ srun --cpus-per-task 1 -t 01:00:01 --pty bash srun: Requested partition configuration not available now srun: job 42095 queued and waiting for resources

it will wait forever because the standard partition is selected, I was thinking that slurm would automatically switch to the medium partition.

Do I misunderstand something there? Or can this be somehow configured.

You can specify multiple partitions, e.g. $ salloc --cpus-per-task=1 --time=01:00:01 --partition=standard,medium,long Notice that rather than using 'srun ... --pty bash', as far as I understand, the preferred method is to use 'salloc' as above, and to use 'srun' for starting MPI processes.

Thanks for the hint. This works nicely, but it would be nice that I would not need to specify the partition at all. Any thoughts?

I am not aware that you can set multiple partition as a default.

Diego suggested a possible way which seems to work after a quick test.

...
The question is why you actually need partitions with different maximum runtimes.

we would like to have only a sub set of the nodes in a partition for long running jobs, so that there are enough nodes available for short jobs.

The nodes for the long partition, however are also part of the short partition so they can also be utilized when no long jobs are running.

That's our idea....

...
In our case, a university cluster with a very wide range of codes and usage patterns, multiple partitions would probably lead to fragmentation and wastage of resources due to the job mix not always fitting well to the various partitions. Therefore, I am a member of the "as few partitions as possible" camp and so in our set-up we have as essentially only one partition with a DefaultTime of 14 days. We do however let users set a QOS to gain a priority boost in return for accepting a shorter run-time and a reduced maximum number of cores.

we didn't look into QOS yet, but this might also a way to go, thanks.

...
Occasionally people complain about short jobs having to wait in the queue for too long, but I have generally been successful in solving the problem by having them estimate their resource requirements better or bundling their work in ordert to increase the run-time-to-wait-time ratio.

Loris Bennett

3:26 p.m.

New subject: [EXTERN] Re: scheduling according time requirements

Hi Dietmar,

Dietmar Rieder via slurm-users slurm-users@lists.schedmd.com writes:

...

Hi Loris,

On 4/30/24 3:43 PM, Loris Bennett via slurm-users wrote:

...
Hi Dietmar, Dietmar Rieder via slurm-users slurm-users@lists.schedmd.com writes:

...
Hi Loris,

On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote:

...
Hi Dietmar, Dietmar Rieder via slurm-users slurm-users@lists.schedmd.com writes:

...
Hi,

is it possible to have slurm scheduling jobs automatical according to the "-t" time requirements to a fitting partition?

e.g. 3 partitions

PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00 DefaultTime=00:10:00 State=UP OverSubscribe=NO PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00 DefaultTime=04:00:00 State=UP OverSubscribe=NO PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00 DefaultTime=24:00:00 State=UP OverSubscribe=NO

So in the standard partition which is the default we have all nodes and a max time of 4h, in the medium partition we have 4 nodes with a max time of 24h and in the long partition we have 2 nodes with a max time of 336h.

I was hoping that if I submit a job with -t 01:00:00 it can be run on any node (standard partition), whereas when specifying -t 05:00:00 or -t 48:00:00 the job will run on the nodes of the medium or long partition respectively.

However, my job will not get scheduled at all when -t is greater than 01:00:00

i.e.

]$ srun --cpus-per-task 1 -t 01:00:01 --pty bash srun: Requested partition configuration not available now srun: job 42095 queued and waiting for resources

it will wait forever because the standard partition is selected, I was thinking that slurm would automatically switch to the medium partition.

Do I misunderstand something there? Or can this be somehow configured.

You can specify multiple partitions, e.g. $ salloc --cpus-per-task=1 --time=01:00:01 --partition=standard,medium,long Notice that rather than using 'srun ... --pty bash', as far as I understand, the preferred method is to use 'salloc' as above, and to use 'srun' for starting MPI processes.

Thanks for the hint. This works nicely, but it would be nice that I would not need to specify the partition at all. Any thoughts?

I am not aware that you can set multiple partition as a default.

Diego suggested a possible way which seems to work after a quick test.

Yes, I wasn't aware of that, but it might also be useful for us, too.

...

...
The question is why you actually need partitions with different maximum runtimes.

we would like to have only a sub set of the nodes in a partition for long running jobs, so that there are enough nodes available for short jobs.

The nodes for the long partition, however are also part of the short partition so they can also be utilized when no long jobs are running.

That's our idea....

If you have plenty of short running jobs, that is probably a reasonable approach. On our system, the number of short running jobs would probably tend to dip significantly over the weekend and public holidays, so resources would potentially be blocked for the long running jobs. On the other hand, long-running jobs on our system often run for days, so one day here or there might not be so significant. And if the long-running jobs were able to start in the short partition, they could block short jobs.

The other thing to think about with regard to short jobs is backfilling. With our mix of jobs, unless a job needs a large amount of memory or number of cores, those with a run-time of only a few hours should be backfilled fairly efficiently.

Regards

Loris

...

...
In our case, a university cluster with a very wide range of codes and usage patterns, multiple partitions would probably lead to fragmentation and wastage of resources due to the job mix not always fitting well to the various partitions. Therefore, I am a member of the "as few partitions as possible" camp and so in our set-up we have as essentially only one partition with a DefaultTime of 14 days. We do however let users set a QOS to gain a priority boost in return for accepting a shorter run-time and a reduced maximum number of cores.

we didn't look into QOS yet, but this might also a way to go, thanks.

...
Occasionally people complain about short jobs having to wait in the queue for too long, but I have generally been successful in solving the problem by having them estimate their resource requirements better or bundling their work in ordert to increase the run-time-to-wait-time ratio.

Dietmar

-- Dr. Loris Bennett (Herr/Mr) FUB-IT (ex-ZEDAT), Freie Universität Berlin

Dietmar Rieder

4:49 p.m.

New subject: [EXTERN] Re: scheduling according time requirements

Hi Loris,

On 4/30/24 4:26 PM, Loris Bennett via slurm-users wrote:

...

Hi Dietmar,

Dietmar Rieder via slurm-users slurm-users@lists.schedmd.com writes:

...
Hi Loris,

On 4/30/24 3:43 PM, Loris Bennett via slurm-users wrote:

...
Hi Dietmar, Dietmar Rieder via slurm-users slurm-users@lists.schedmd.com writes:

...
Hi Loris,

On 4/30/24 2:53 PM, Loris Bennett via slurm-users wrote:

...
Hi Dietmar, Dietmar Rieder via slurm-users slurm-users@lists.schedmd.com writes:

...
Hi,

is it possible to have slurm scheduling jobs automatical according to the "-t" time requirements to a fitting partition?

e.g. 3 partitions

PartitionName=standard Nodes=c-[01-10] Default=YES MaxTime=04:00:00 DefaultTime=00:10:00 State=UP OverSubscribe=NO PartitionName=medium Nodes=c-[04-08] Default=NO MaxTime=24:00:00 DefaultTime=04:00:00 State=UP OverSubscribe=NO PartitionName=long Nodes=c-[09-10] Default=NO MaxTime=336:00:00 DefaultTime=24:00:00 State=UP OverSubscribe=NO

So in the standard partition which is the default we have all nodes and a max time of 4h, in the medium partition we have 4 nodes with a max time of 24h and in the long partition we have 2 nodes with a max time of 336h.

I was hoping that if I submit a job with -t 01:00:00 it can be run on any node (standard partition), whereas when specifying -t 05:00:00 or -t 48:00:00 the job will run on the nodes of the medium or long partition respectively.

However, my job will not get scheduled at all when -t is greater than 01:00:00

i.e.

]$ srun --cpus-per-task 1 -t 01:00:01 --pty bash srun: Requested partition configuration not available now srun: job 42095 queued and waiting for resources

it will wait forever because the standard partition is selected, I was thinking that slurm would automatically switch to the medium partition.

Do I misunderstand something there? Or can this be somehow configured.

You can specify multiple partitions, e.g. $ salloc --cpus-per-task=1 --time=01:00:01 --partition=standard,medium,long Notice that rather than using 'srun ... --pty bash', as far as I understand, the preferred method is to use 'salloc' as above, and to use 'srun' for starting MPI processes.

Thanks for the hint. This works nicely, but it would be nice that I would not need to specify the partition at all. Any thoughts?

I am not aware that you can set multiple partition as a default.

Diego suggested a possible way which seems to work after a quick test.

Yes, I wasn't aware of that, but it might also be useful for us, too.

...
...
The question is why you actually need partitions with different maximum runtimes.

we would like to have only a sub set of the nodes in a partition for long running jobs, so that there are enough nodes available for short jobs.

The nodes for the long partition, however are also part of the short partition so they can also be utilized when no long jobs are running.

That's our idea....

If you have plenty of short running jobs, that is probably a reasonable approach. On our system, the number of short running jobs would probably tend to dip significantly over the weekend and public holidays, so resources would potentially be blocked for the long running jobs. On the other hand, long-running jobs on our system often run for days, so one day here or there might not be so significant. And if the long-running jobs were able to start in the short partition, they could block short jobs.

The other thing to think about with regard to short jobs is backfilling. With our mix of jobs, unless a job needs a large amount of memory or number of cores, those with a run-time of only a few hours should be backfilled fairly efficiently.

you are absolutely right, and I guess we will nee to optimize using QoS. Thanks for your input and thoughts.

...

Regards

Loris

...
...
In our case, a university cluster with a very wide range of codes and usage patterns, multiple partitions would probably lead to fragmentation and wastage of resources due to the job mix not always fitting well to the various partitions. Therefore, I am a member of the "as few partitions as possible" camp and so in our set-up we have as essentially only one partition with a DefaultTime of 14 days. We do however let users set a QOS to gain a priority boost in return for accepting a shorter run-time and a reduced maximum number of cores.

we didn't look into QOS yet, but this might also a way to go, thanks.

...
Occasionally people complain about short jobs having to wait in the queue for too long, but I have generally been successful in solving the problem by having them estimate their resource requirements better or bundling their work in ordert to increase the run-time-to-wait-time ratio.

Dietmar

549

Age (days ago)

549

Last active (days ago)

slurm-users@lists.schedmd.com

10 comments

4 participants

tags (0)

participants (4)

Diego Zuccato
Dietmar Rieder
Loris Bennett
Thomas Hartmann