Hi, we're testing possible slurm configurations on a test system right now. Eventually, it is going to serve ~1000 users.
We're going to have some users who are going to run lots of short jobs (a couple of minutes to ~4h) and some users that run jobs that are going to run for days or weeks. I want to avoid a situation in which a group of users basically saturates the whole cluster with jobs that run for a week or two and nobody could run any short jobs anymore. I also would like to favor short jobs, because they make the whole cluster feel more dynamic and agile for everybody.
On the other hand, I would like to make the most of the ressources, i.e. when nobody is sending short jobs, long jobs could run on all the nodes.
My idea was to basically have three partitions:
1. PartitionName=short MaxTime=04:00:00 State=UP Nodes=node[01-99] PriorityTier=100 2. PartitionName=long_safe MaxTime=14-00:00:00 State=UP Nodes=node[01-50] PriorityTier=100 3. PartitionName=long_preempt MaxTime=14-00:00:00 State=UP Nodes=nodes[01-99] PriorityTier=40 PreemptMode=requeue
and then use the JobSubmitPlugin "all_partitions" so that all jobs get submitted to all partitions by default. This way, a short job ends up in the `short` partition and is able to use all nodes. A long job ends up using the `long_safe` partition until for the first 50 nodes. These jobs are not going to be preempted. Remaining long jobs use the `long_preempt` queue. So they run on the remaining nodes as long as there are no higher prio short (or long) jobs in the queue.
So, the cluster could be saturated with long running jobs but if short jobs are submitted and the user has a high enough fair share, some of the long jobs would get preempted and the short ones would run.
This scenario works fine.... BUT the long jobs seem to be playing pingpong on the `long_preempt` partition because as soon as they run, they stop accruing AGE priority unlike still queued jobs. As soon as a queued job, albeit by the same user, "overtakes" a running one, it preempts the running one, stops accruing age and so on....
So, is there maybe a cleverer way to do this?
Thanks a lot! Thomas
Hi Thomas,
"thomas.hartmann--- via slurm-users" slurm-users@lists.schedmd.com writes:
Hi, we're testing possible slurm configurations on a test system right now. Eventually, it is going to serve ~1000 users.
We're going to have some users who are going to run lots of short jobs (a couple of minutes to ~4h) and some users that run jobs that are going to run for days or weeks. I want to avoid a situation in which a group of users basically saturates the whole cluster with jobs that run for a week or two and nobody could run any short jobs anymore. I also would like to favor short jobs, because they make the whole cluster feel more dynamic and agile for everybody.
On the other hand, I would like to make the most of the ressources, i.e. when nobody is sending short jobs, long jobs could run on all the nodes.
My idea was to basically have three partitions:
- PartitionName=short MaxTime=04:00:00 State=UP Nodes=node[01-99] PriorityTier=100
- PartitionName=long_safe MaxTime=14-00:00:00 State=UP Nodes=node[01-50] PriorityTier=100
- PartitionName=long_preempt MaxTime=14-00:00:00 State=UP Nodes=nodes[01-99] PriorityTier=40 PreemptMode=requeue
and then use the JobSubmitPlugin "all_partitions" so that all jobs get submitted to all partitions by default. This way, a short job ends up in the `short` partition and is able to use all nodes. A long job ends up using the `long_safe` partition until for the first 50 nodes. These jobs are not going to be preempted. Remaining long jobs use the `long_preempt` queue. So they run on the remaining nodes as long as there are no higher prio short (or long) jobs in the queue.
So, the cluster could be saturated with long running jobs but if short jobs are submitted and the user has a high enough fair share, some of the long jobs would get preempted and the short ones would run.
This scenario works fine.... BUT the long jobs seem to be playing pingpong on the `long_preempt` partition because as soon as they run, they stop accruing AGE priority unlike still queued jobs. As soon as a queued job, albeit by the same user, "overtakes" a running one, it preempts the running one, stops accruing age and so on....
So, is there maybe a cleverer way to do this?
Thanks a lot! Thomas
I have never really understood the approach of having different partitions for different lengths of job, but it seems to be quite widespread, so I assume there are valid use cases.
However, for our around 450 users, of which about 200 will submit at least one job in a given month, we have an alternative approach without pre-emption where we essentially have just a single partition. Users can then specify a QOS which will increase priority at the cost of accepting a lower cap on number of jobs/resources/maximum runtime:
$ sqos Name Priority MaxWall MaxJobs MaxSubmit MaxTRESPU ---------- ---------- ----------- ------- --------- -------------------- hiprio 100000 03:00:00 50 100 cpu=128,gres/gpu=4 prio 1000 3-00:00:00 500 1000 cpu=256,gres/gpu=8 standard 0 14-00:00:00 2000 10000 cpu=768,gres/gpu=16
where
alias sqos='sacctmgr show qos format=name,priority,maxwall,maxjobs,maxsubmitjobs,maxtrespu%20' /usr/bin/sacctmgr
The standard cap on the resources corresponds to about 1/7 of our cores.
The downside is that very occasionally nodes may idle because a user has reached his or her cap. However, we have usually have enough uncapped users submitting jobs, so that in fact this happens only rarely, such as sometimes at Christmas or New Year.
Cheers,
Loris
Le 04/04/2024 à 03:33, Loris Bennett via slurm-users a écrit :
I have never really understood the approach of having different partitions for different lengths of job, but it seems to be quite widespread, so I assume there are valid use cases.
However, for our around 450 users, of which about 200 will submit at least one job in a given month, we have an alternative approach without pre-emption where we essentially have just a single partition. Users can then specify a QOS which will increase priority at the cost of accepting a lower cap on number of jobs/resources/maximum runtime:
$ sqos Name Priority MaxWall MaxJobs MaxSubmit MaxTRESPU
hiprio 100000 03:00:00 50 100 cpu=128,gres/gpu=4 prio 1000 3-00:00:00 500 1000 cpu=256,gres/gpu=8
standard 0 14-00:00:00 2000 10000 cpu=768,gres/gpu=16
where
alias sqos='sacctmgr show qos format=name,priority,maxwall,maxjobs,maxsubmitjobs,maxtrespu%20' /usr/bin/sacctmgr
The standard cap on the resources corresponds to about 1/7 of our cores.
The downside is that very occasionally nodes may idle because a user has reached his or her cap. However, we have usually have enough uncapped users submitting jobs, so that in fact this happens only rarely, such as sometimes at Christmas or New Year.
Cheers,
Loris
Hi Loris, Tomas
I'm new too in using slurm shceduler.
In your configuration, you have to define a DefaultQOS for each User, or Association, right? You don't defina DefaultQOS at partition nivel..
Thank's!
thomas.hartmann--- via slurm-users wrote:
My idea was to basically have three partitions:
- PartitionName=short MaxTime=04:00:00 State=UP Nodes=node[01-99] PriorityTier=100
- PartitionName=long_safe MaxTime=14-00:00:00 State=UP Nodes=node[01-50] PriorityTier=100
- PartitionName=long_preempt MaxTime=14-00:00:00 State=UP Nodes=nodes[01-99] PriorityTier=40 PreemptMode=requeue
I don't know why you consider preemption if you have short jobs, just wait for jobs to finish.
My first approach would be to have two partitions, both of them containing all nodes, but diffent QoSes assigned to them, so you can limit the short jobs to a certain amount of cpus and also limit long jobs to a certain amount of cpus - maybe 80% for each of them.
Gerhard
Hi, I'm currently testing an approach similar to the example by Loris.
Why consider preemption? Because, in the original example, if the cluster is saturated by long running jobs (like 2 weeks), there should be the possibility to run short jobs right away.
Best, Thomas