[slurm-users] [External] Re: Partition question
Ole Holm Nielsen
Ole.H.Nielsen at fysik.dtu.dk
Thu Dec 19 19:40:43 UTC 2019
Some examples are here:
https://wiki.fysik.dtu.dk/niflheim/Slurm_accounting#quality-of-service-qos
/Ole
On 19-12-2019 19:30, Prentice Bisbal wrote:
>
> On 12/19/19 10:44 AM, Ransom, Geoffrey M. wrote:
>>
>> The simplest is probably to just have a separate partition that will
>> only allow job times of 1 hour or less.
>>
>> This is how our Univa queues used to work, by overlapping the same
>> hardware. Univa shows available “slots” to the users and we had a lot
>> of confused users complaining about all those free slots (busy slots
>> in the other queue) while their jobs sat on the queue and new users
>> confused as to why their jobs were being killed after 4 hours. I was
>> able to move the short/long behavior to job classes and use RQSes and
>> have one queue.
>>
>> While slurm isn’t showing users unused resources I am concerned that
>> going back to two queues (partitions) will cause user interaction and
>> adoption problems.
>>
>> It all depends on what best suits the specific needs.
>>
>> Is there a way to have one partition that holds aside a small
>> percentage of resources for jobs with a runtime under 4 hours, i.e.
>> jobs with long runtimes cannot tie up 100% of the resources at one
>> time? Some kind of virtual partition that feeds into two other
>> partitions based on runtime would also work. The goal is that users
>> can continue to post jobs to one partition but the scheduler won’t let
>> 100% of the compute resources get tied up with mutli-week long jobs.
>>
> The way to do this is with Quality of Service (QOS) in Slurm. When
> creating a QOS, you can specify the max. number of tasks a QOS can use.
> Create a QOS for the longer running jobs and set the MaxGrpTRES so that
> the number of CPUs is less that 100% of your cluster. Create a QOS for
> the shorter jobs with a shorter time limit (MaxWall).
>
> Once the QOSes are setup, you can instruct your users to specify the
> proper QOS when submitting a job, or edit the job_submit.lua script to
> look at the time limit specified, and assign/override the QOS based on
> that.
More information about the slurm-users
mailing list