[slurm-users] Limit number of jobs on shared nodes?
pedmon at cfa.harvard.edu
Fri May 4 08:05:01 MDT 2018
You might try using Partition QoS's, those can do a bunch of neat features.
On 05/04/2018 09:59 AM, Liam Forbes wrote:
> We have three "big memory" nodes. We'd like to limit the number of
> jobs that run per node in two partitions that share these nodes. Jobs
> in these two partitions are limited to a single node max. We'd only
> like 8 or fewer jobs from either partition to run per node. So at most
> only 16 jobs should be allowed to share a given node.
> Currently, we have
> in our slurm.conf
> The nodes are defined as:
> NodeName=n[144-146] NodeAddr=10.50.50.[144-146] CPUs=56 Sockets=2
> CoresPerSocket=14 ThreadsPerCore=2 RealMemory=1500000
> <tel:1500000> State=UNKNOWN
> The two partitions are defined as:
> PartitionName=analysis Nodes=n[144-146] MaxTime=4-0:0 MaxNodes=1
> State=UP AllowGroups=all Priority=100 OverSubscribe=FORCE:4 Hidden=NO
> PartitionName=bio Nodes=n[144-146] MaxTime=14-0:0 MaxNodes=1 State=UP
> AllowGroups=all Priority=100 OverSubscribe=FORCE:4 Hidden=NO Default=NO
> We discovered the hard way this means users can run 4 jobs per each of
> the 56 CPUs/threads on each node. Oops! Not what we intended.
> All our other compute nodes are defined as exclusive, and we don't
> allow multiple jobs to run on them.
> Any recommendations how to implement the 8 jobs per partition per node
> limit we'd like? Should we switch our SelectTypeParameters to
> CR_Socket or CR_Socket_Memory, for example?
> -There are uncountably more irrational fears than rational ones. -P. Dolan
> Liam Forbes loforbes at alaska.edu <mailto:loforbes at alaska.edu> ph:
> 907-450-8618 <tel:907-450-8618> fax: 907-450-8601 <tel:907-450-8601>
> UAF Research Computing Systems Senior HPC Engineer CISSP
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users