[slurm-users] [External] Re: Partition question

Ole Holm Nielsen Ole.H.Nielsen at fysik.dtu.dk
Thu Dec 19 19:40:43 UTC 2019

Some examples are here:


On 19-12-2019 19:30, Prentice Bisbal wrote:
> On 12/19/19 10:44 AM, Ransom, Geoffrey M. wrote:
>> The simplest is probably to just have a separate partition that will 
>> only allow job times of 1 hour or less.
>> This is how our Univa queues used to work, by overlapping the same 
>> hardware. Univa shows available “slots” to the users and we had a lot 
>> of confused users complaining about all those free slots (busy slots 
>> in the other queue) while their jobs sat on the queue and new users 
>> confused as to why their jobs were being killed after 4 hours. I was 
>> able to move the short/long behavior to job classes and use RQSes and 
>> have one queue.
>> While slurm isn’t showing users unused resources I am concerned that 
>> going back to two queues (partitions) will cause user interaction and 
>> adoption problems.
>>          It all depends on what best suits the specific needs.
>> Is there a way to have one partition that holds aside a small 
>> percentage of resources for jobs with a runtime under 4 hours, i.e. 
>> jobs with long runtimes cannot tie up 100% of the resources at one 
>> time? Some kind of virtual partition that feeds into two other 
>> partitions based on runtime would also work. The goal is that users 
>> can continue to post jobs to one partition but the scheduler won’t let 
>> 100% of the compute resources get tied up with mutli-week long jobs.
> The way to do this is with Quality of Service (QOS) in Slurm. When 
> creating a QOS, you can specify the max. number of tasks a QOS can use. 
> Create a QOS for the longer running jobs and set the MaxGrpTRES so that 
> the number of CPUs is less that 100% of your cluster. Create a QOS for 
> the shorter jobs with a shorter time limit (MaxWall).
> Once the QOSes are setup, you can instruct your users to specify the 
> proper QOS when submitting a job, or edit the job_submit.lua script to 
> look at the time limit specified, and assign/override the QOS based on 
> that.

More information about the slurm-users mailing list