[slurm-users] Distribute jobs in similar nodes in the same partition
Marcus Wagner
wagner at itc.rwth-aachen.de
Tue May 15 06:13:53 MDT 2018
Hi Antonio,
if you don't care, which nodes are used, but want to ensure that only
uniform nodes are used, you could also use the topology/tree plugin,
where you define on "switch" for every node type and then use sbatch
--switches=1.
Best
Marcus
On 05/11/2018 05:49 PM, Antonio Lara wrote:
>
> Thank you all for your answers, I will research some more along these
> lines!
>
> Any other opinion is welcome
>
> Regards,
>
> Antonio
>
>
> El 11/05/18 a las 16:05, Vicker, Darby (JSC-EG311) escribió:
>>
>> I’ll second that – we have a cluster with 4 generations of nodes. We
>> assign a processor type feature to each node and require the users to
>> ask for at least one of those features in their jobs via
>> job_submit.lua – see the code below. For a job that can run on any
>> processor type, you can use this:
>>
>> #SBATCH --constraint=[wes|san|has|bro]
>>
>> See the constraint section of “man sbatch” for more details but this
>> will constrain the job to any processor type but all nodes of one
>> type. It really works great from a utilization standpoint – jobs
>> will run on the first processor type that is free.
>>
>> local feature_count = 0
>>
>> if job_desc ~= nil and job_desc.features ~= nil then
>>
>> if string.match(job_desc.features, "wes") then
>> feature_count=feature_count+1 end
>>
>> if string.match(job_desc.features, "san") then
>> feature_count=feature_count+1 end
>>
>> if string.match(job_desc.features, "has") then
>> feature_count=feature_count+1 end
>>
>> if string.match(job_desc.features, "bro") then
>> feature_count=feature_count+1 end
>>
>> end
>>
>> if feature_count > 0 then
>>
>> slurm.log_info("Found %s valid cpu features",feature_count)
>>
>> else
>>
>> slurm.log_user("Invalid features - aerolab policy requires specifying
>> one or more of wes,san,has,bro.")
>>
>> slurm.log_error("Found %s cpu features from %s",feature_count,submit_uid)
>>
>> -- See slurm/slurm_errno.h and src/common/slurm_errno.c
>>
>> -- for the list of error codes and messages.
>>
>> return 2002
>>
>> end
>>
>> Of course, the user can leave off the square brackets and get any mix
>> of processor types. We have some codes that run fine across
>> different processor types so we allow this. Our group is small
>> enough that we can easily educate and police the users to do the
>> right thing. But you could add more logic to job_submit.lua to
>> require the brackets if you wanted to.
>>
>> Darby
>>
>> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf
>> of Hadrian Djohari <hxd58 at case.edu>
>> *Reply-To: *Slurm User Community List <slurm-users at lists.schedmd.com>
>> *Date: *Friday, May 11, 2018 at 5:22 AM
>> *To: *Slurm User Community List <slurm-users at lists.schedmd.com>
>> *Cc: *"slurm-users at schedmd.com" <slurm-users at schedmd.com>
>> *Subject: *Re: [slurm-users] Distribute jobs in similar nodes in the
>> same partition
>>
>> You can use node feature in defining the node types in slurm.conf.
>>
>> Then when requesting for the job, use -C <feature name> toy just use
>> those node type.
>>
>> On Fri, May 11, 2018, 5:38 AM Antonio Lara <antonio.lara at uam.es
>> <mailto:antonio.lara at uam.es>> wrote:
>>
>> Hello everyone,
>>
>> Hopefully someone can help me with this, I cannot find in the
>> manual if
>> this is even possible:
>>
>> I'm a system administrator, and the following question is from the
>> administrator point of view, not the user's point of view:
>>
>> I work with a cluster which has a partition containing many
>> nodes. These
>> nodes belong to "different categories". This is, we bought at once
>> several machines that are of the same type, and we did this several
>> times. So, for example, we have 10 machines of type A, 20
>> machines of
>> type B and 15 machines of type C. Machines of type A are more
>> powerful
>> than machines of type B, which are more powerful than machines of
>> type C.
>>
>> What I am trying to achieve is that Slurm "forces" parallelized
>> jobs to
>> be allocated in machines of the same type, if possible. That is,
>> that
>> there is some type of priority which tries to allocate only
>> machines of
>> type A, or only machines of type B, or only of type C, and only
>> distribute jobs among machines of different types when there are not
>> enough nodes of the same type available.
>>
>> Does anyone know if this is possible? The idea behind this is that
>> slower machines are not delaying the calculations in faster machines
>> when a job is distributed among them, and all machines work more
>> or less
>> at the same pace.
>>
>> I've been told that It is NOT an option to create different
>> partitions,
>> each containing only one type of machine.
>>
>> Please, note that I'm not looking for a way to choose as a user
>> which
>> nodes to use for a job, what I need is that slurm does that, and
>> decides
>> what nodes to use, using similar nodes if available.
>>
>> The closest that I could find in the manual was using consumable
>> resources, but I think this is not what I need, there are several
>> examples, but they don't seem to fit with this.
>>
>> Thank you for your help!
>>
>
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180515/d2c6fba8/attachment-0001.html>
More information about the slurm-users
mailing list