[slurm-users] Distribute jobs in similar nodes in the same partition

Marcus Wagner wagner at itc.rwth-aachen.de
Tue May 15 06:13:53 MDT 2018


Hi Antonio,

if you don't care, which nodes are used, but want to ensure that only 
uniform nodes are used, you could also use the topology/tree plugin, 
where you define on "switch" for every node type and then use sbatch 
--switches=1.

Best
Marcus

On 05/11/2018 05:49 PM, Antonio Lara wrote:
>
> Thank you all for your answers, I will research some more along these 
> lines!
>
> Any other opinion is welcome
>
> Regards,
>
> Antonio
>
>
> El 11/05/18 a las 16:05, Vicker, Darby (JSC-EG311) escribió:
>>
>> I’ll second that – we have a cluster with 4 generations of nodes.  We 
>> assign a processor type feature to each node and require the users to 
>> ask for at least one of those features in their jobs via 
>> job_submit.lua – see the code below.  For a job that can run on any 
>> processor type, you can use this:
>>
>> #SBATCH --constraint=[wes|san|has|bro]
>>
>> See the constraint section of “man sbatch” for more details but this 
>> will constrain the job to any processor type but all nodes of one 
>> type.  It really works great from a utilization standpoint – jobs 
>> will run on the first processor type that is free.
>>
>> local feature_count = 0
>>
>>    if job_desc ~= nil and job_desc.features ~= nil then
>>
>> if string.match(job_desc.features, "wes") then 
>> feature_count=feature_count+1 end
>>
>> if string.match(job_desc.features, "san") then 
>> feature_count=feature_count+1 end
>>
>> if string.match(job_desc.features, "has") then 
>> feature_count=feature_count+1 end
>>
>> if string.match(job_desc.features, "bro") then 
>> feature_count=feature_count+1 end
>>
>>    end
>>
>>    if feature_count > 0 then
>>
>> slurm.log_info("Found %s valid cpu features",feature_count)
>>
>>    else
>>
>> slurm.log_user("Invalid features - aerolab policy requires specifying 
>> one or more of wes,san,has,bro.")
>>
>> slurm.log_error("Found %s cpu features from %s",feature_count,submit_uid)
>>
>> -- See slurm/slurm_errno.h and src/common/slurm_errno.c
>>
>> -- for the list of error codes and messages.
>>
>> return 2002
>>
>>    end
>>
>> Of course, the user can leave off the square brackets and get any mix 
>> of processor types.  We have some codes that run fine across 
>> different processor types so we allow this.  Our group is small 
>> enough that we can easily educate and police the users to do the 
>> right thing.  But you could add more logic to job_submit.lua to 
>> require the brackets if you wanted to.
>>
>> Darby
>>
>> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf 
>> of Hadrian Djohari <hxd58 at case.edu>
>> *Reply-To: *Slurm User Community List <slurm-users at lists.schedmd.com>
>> *Date: *Friday, May 11, 2018 at 5:22 AM
>> *To: *Slurm User Community List <slurm-users at lists.schedmd.com>
>> *Cc: *"slurm-users at schedmd.com" <slurm-users at schedmd.com>
>> *Subject: *Re: [slurm-users] Distribute jobs in similar nodes in the 
>> same partition
>>
>> You can use node feature in defining the node types in slurm.conf.
>>
>> Then when requesting for the job, use -C <feature name> toy just use 
>> those node type.
>>
>> On Fri, May 11, 2018, 5:38 AM Antonio Lara <antonio.lara at uam.es 
>> <mailto:antonio.lara at uam.es>> wrote:
>>
>>     Hello everyone,
>>
>>     Hopefully someone can help me with this, I cannot find in the
>>     manual if
>>     this is even possible:
>>
>>     I'm a system administrator, and the following question is from the
>>     administrator point of view, not the user's point of view:
>>
>>     I work with a cluster which has a partition containing many
>>     nodes. These
>>     nodes belong to "different categories". This is, we bought at once
>>     several machines that are of the same type, and we did this several
>>     times. So, for example, we have 10 machines of type A, 20
>>     machines of
>>     type B and 15 machines of type C. Machines of type A are more
>>     powerful
>>     than machines of type B, which are more powerful than machines of
>>     type C.
>>
>>     What I am trying to achieve is that Slurm "forces" parallelized
>>     jobs to
>>     be allocated in machines of the same type, if possible. That is,
>>     that
>>     there is some type of priority which tries to allocate only
>>     machines of
>>     type A, or only machines of type B, or only of type C, and only
>>     distribute jobs among machines of different types when there are not
>>     enough nodes of the same type available.
>>
>>     Does anyone know if this is possible? The idea behind this is that
>>     slower machines are not delaying the calculations in faster machines
>>     when a job is distributed among them, and all machines work more
>>     or less
>>     at the same pace.
>>
>>     I've been told that It is NOT an option to create different
>>     partitions,
>>     each containing only one type of machine.
>>
>>     Please, note that I'm not looking for a way to choose as a user
>>     which
>>     nodes to use for a job, what I need is that slurm does that, and
>>     decides
>>     what nodes to use, using similar nodes if available.
>>
>>     The closest that I could find in the manual was using consumable
>>     resources, but I think this is not what I need, there are several
>>     examples, but they don't seem to fit with this.
>>
>>     Thank you for your help!
>>
>

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180515/d2c6fba8/attachment-0001.html>


More information about the slurm-users mailing list