[slurm-users] Spreading jobs across servers instead of loading up individual nodes

Aravindh Sampathkumar aravindh at fastmail.com
Mon Nov 19 07:03:43 MST 2018


Paul, 

Thanks for the response, I tested and activated LLN on a per partition
basis in our cluster. Works exactly as described and satisfied my need
perfectly. Thanks!
--
  Aravindh Sampathkumar
  aravindh at fastmail.com



On Thu, Nov 15, 2018, at 3:25 PM, Paul Edmon wrote:
> I recommend the LLN option for partitions:


> *LLN* Schedule resources to jobs on the least loaded nodes (based upon
> the number of idle CPUs). This is generally only recommended for an
> environment with serial jobs as idle resources will tend to be highly
> fragmented, resulting in parallel jobs being distributed across many
> nodes. Note that node *Weight* takes precedence over how many idle
> resources are on each node. Also see the *SelectParameters*
> configuration parameter *CR_LLN* to use the least loaded nodes in
> every partition.> -Paul Edmon-


> On 11/15/2018 4:25 AM, Aravindh Sampathkumar wrote:
>> Hi All.
>> 
>> I'm having some trouble finding appropriate section of the
>> documentation to change slurm resource allocation policy.>> 
>> We have configured CPU and memory as consumable resources, and our
>> nodes can run multiple jobs as long as there are CPU memory
>> available.>> 
>> What I want is for Slurm to spread jobs across all available servers
>> in a partition instead of loading up few servers while others are
>> idling.>> 
>> For example, I have a partition nav which has 5 compute nodes(node[1-
>> 5]) dedicated to it.>> when users submit 3 jobs to nav partition, each requesting 1 CPU core
>> and 1 GB of memory, SLURM schedules all the jobs in node1 because it
>> has enough CPU cores and memory to satisfy job requirements. nodes -
>> 2,3,4,5 are idle.>> 
>> What I want instead is for slurm to schedule job1 to node1, job2 to
>> node2, job3 to node3.. and then in the future if there are more jobs
>> than there are nodes, slurm must utilise the rest of resources
>> available in node1.>> 
>> 
>> Why? 
>> A small group that is using this partition is concerned that all
>> their jobs get scheduled on the same node, and they  need to share
>> network bandwidth, and bandwidth to local disk. If they were spread
>> out instead, they could use better bandwidth.>> 
>> Appreciate any advice how I can make this happen. 
>> 
>> Thanks,
>>   Aravindh Sampathkumar
>>   aravindh at fastmail.com
>> 
>> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20181119/3e97b838/attachment-0001.html>


More information about the slurm-users mailing list