[slurm-users] Partition question

Thu Dec 19 16:22:55 UTC 2019

My current batch queues have a 30-day limit, and I’ll likely be reducing that to maybe 7 days for most users in the near future, as it will make priority and fairshare mechanisms more responsive (even if a high-priority job gets bumped to the top of the queue, it may still have to wait a few days for enough resources to become available — some of my users make very long reservations for very small amounts of resources).

We end up balancing priority against resource consumption (resources by time). Debug queues with very short time limits and high priority, interactive queues with relatively short time limits and medium priority, and batch queues with long time limits and low priority.

Since we also have some GPU nodes that won’t use all their cores for normal GPU jobs, we also allow jobs with lower CPU counts to use those “excess” CPU cores in the GPU nodes.

Relevant slurm.conf lines:

=====

# cgroup bulletproofing
ProctrackType=proctrack/cgroup
TaskPlugin=task/affinity,task/cgroup
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory

# Multifactor prioritization
PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityWeightFairshare=100000
PriorityWeightAge=1000
PriorityWeightPartition=10000
PriorityWeightJobSize=1000
PriorityMaxAge=1-0

# Scheduling
SchedulerType=sched/backfill
SchedulerParameters=bf_window=43200,bf_resolution=2160,bf_max_job_user=80,bf_continue,default_queue_depth=200

# Nodes (weights adapted from https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#node-weight)
NodeName=gpunode[001-003] CoresPerSocket=14 RealMemory=382000 Sockets=2 ThreadsPerCore=1 Weight=10011 Gres=gpu:2
NodeName=gpunode004 CoresPerSocket=14 RealMemory=894000 Sockets=2 ThreadsPerCore=1 Weight=10021 Gres=gpu:2
NodeName=node[001-022] CoresPerSocket=14 RealMemory=62000 Sockets=2 ThreadsPerCore=1 Weight=10201
NodeName=node[023-034] CoresPerSocket=14 RealMemory=126000 Sockets=2 ThreadsPerCore=1 Weight=10211
NodeName=node[035-040] CoresPerSocket=14 RealMemory=254000 Sockets=2 ThreadsPerCore=1 Weight=10221

# Partitions
PartitionName=interactive Default=NO MinNodes=1 MaxNodes=4 DefaultTime=02:00:00 MaxTime=02:00:00 AllowGroups=ALL PriorityJobFactor=3 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=2000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=node[001-040]
PartitionName=any-interactive Default=NO MinNodes=1 MaxNodes=4 DefaultTime=02:00:00 MaxTime=02:00:00 AllowGroups=ALL PriorityJobFactor=3 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=2000 AllowAccounts=ALL AllowQos=ALL LLN=NO MaxCPUsPerNode=12 ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=node[001-040],gpunode[001-004]
PartitionName=debug Default=NO MinNodes=1 MaxTime=00:30:00 AllowGroups=ALL PriorityJobFactor=2 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=2000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=node[001-040]PartitionName=gpu Default=NO MinNodes=1 DefaultTime=1-00:00:00 MaxTime=30-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=2000 AllowAccounts=ALL AllowQos=ALL LLN=NO MaxCPUsPerNode=16 QoS=gpu ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=gpunode[001-004]
PartitionName=batch Default=YES MinNodes=1 MaxNodes=40 DefaultTime=1-00:00:00 MaxTime=30-00:00:00 AllowGroups=ALL PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO DefMemPerCPU=2000 AllowAccounts=ALL AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=NO OverTimeLimit=0 State=UP Nodes=node[001-040]

=====

So the critical pieces there are:

- backfill scheduling with appropriate parameters for windows, resolutions, and job counts, allowing some short-run jobs to fit themselves in ahead of other jobs waiting for lots of resources
- multiple partitions with overlapping resources, but with different priorities (PriorityJobFactor) and resource limits (any-interactive covering all 44 nodes, but only up to 12 cores/node, interactive covering the 40 non-GPU nodes, but allowing all 28 cores/node)

I also have a job_submit.lua that allows me to tell users, “you only need to know about three partitions: interactive, debug, and batch” while routing their jobs to the gpu or any partitions if their jobs request a GPU or if they only request a small number of cores. I don’t hide those other partitions from view, but users really never notice them.

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601     / Tennessee Tech University

> On Dec 19, 2019, at 9:44 AM, Ransom, Geoffrey M. <Geoffrey.Ransom at jhuapl.edu> wrote:
> 
> External Email Warning
> This email originated from outside the university. Please use caution when opening attachments, clicking links, or responding to requests.
>  
>          The simplest is probably to just have a separate partition that will only allow job times of 1 hour or less.
> 
> This is how our Univa queues used to work, by overlapping the same hardware. Univa shows available “slots” to the users and we had a lot of confused users complaining about all those free slots (busy slots in the other queue) while their jobs sat on the queue and new users confused as to why their jobs were being killed after 4 hours. I was able to move the short/long behavior to job classes and use RQSes and have one queue.
> 
> While slurm isn’t showing users unused resources I am concerned that going back to two queues (partitions) will cause user interaction and adoption problems.
> 
>          It all depends on what best suits the specific needs.
> 
> Is there a way to have one partition that holds aside a small percentage of resources for jobs with a runtime under 4 hours, i.e. jobs with long runtimes cannot tie up 100% of the resources at one time? Some kind of virtual partition that feeds into two other partitions based on runtime would also work. The goal is that users can continue to post jobs to one partition but the scheduler won’t let 100% of the compute resources get tied up with mutli-week long jobs.
> 
> Thanks.
> 
> On 12/16/2019 2:29 PM, Ransom, Geoffrey M. wrote:
>  
> Hello
>    I am looking into switching from Univa (sge) to slurm and am figuring out how to implement some of our usage policy in slurm.
>  
> We have a Univa queue which uses job classes and RQSes to limit jobs with a run time over 4 hours to only half the available slots (CPU cores) so some slots are always free for quick jobs. We don’t want all of our resources tied up with multiweek jobs when someone has a batch of 1 hour jobs to run.
>  
> Is there a way to implement this in slurm? To have a partition which will hold some CPU/GPU resources aside for jobs with a short runtime.
>  
> What would be the preferred solution for this issue in a slurm world?