[slurm-users] Distribute jobs in similar nodes in the same partition

Antonio Lara antonio.lara at uam.es
Fri May 11 09:49:52 MDT 2018


Thank you all for your answers, I will research some more along these lines!

Any other opinion is welcome

Regards,

Antonio


El 11/05/18 a las 16:05, Vicker, Darby (JSC-EG311) escribió:
>
> I’ll second that – we have a cluster with 4 generations of nodes.  We 
> assign a processor type feature to each node and require the users to 
> ask for at least one of those features in their jobs via 
> job_submit.lua – see the code below.  For a job that can run on any 
> processor type, you can use this:
>
> #SBATCH --constraint=[wes|san|has|bro]
>
> See the constraint section of “man sbatch” for more details but this 
> will constrain the job to any processor type but all nodes of one 
> type.  It really works great from a utilization standpoint – jobs will 
> run on the first processor type that is free.
>
>    local feature_count = 0
>
>    if job_desc ~= nil and job_desc.features ~= nil then
>
>       if string.match(job_desc.features, "wes") then 
> feature_count=feature_count+1 end
>
>       if string.match(job_desc.features, "san") then 
> feature_count=feature_count+1 end
>
>       if string.match(job_desc.features, "has") then 
> feature_count=feature_count+1 end
>
>       if string.match(job_desc.features, "bro") then 
> feature_count=feature_count+1 end
>
>    end
>
>    if feature_count > 0 then
>
> slurm.log_info("Found %s valid cpu features",feature_count)
>
>    else
>
> slurm.log_user("Invalid features - aerolab policy requires specifying 
> one or more of wes,san,has,bro.")
>
> slurm.log_error("Found %s cpu features from %s",feature_count,submit_uid)
>
>       -- See slurm/slurm_errno.h and src/common/slurm_errno.c
>
>       -- for the list of error codes and messages.
>
> return 2002
>
>    end
>
> Of course, the user can leave off the square brackets and get any mix 
> of processor types.  We have some codes that run fine across different 
> processor types so we allow this.  Our group is small enough that we 
> can easily educate and police the users to do the right thing.  But 
> you could add more logic to job_submit.lua to require the brackets if 
> you wanted to.
>
> Darby
>
> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf 
> of Hadrian Djohari <hxd58 at case.edu>
> *Reply-To: *Slurm User Community List <slurm-users at lists.schedmd.com>
> *Date: *Friday, May 11, 2018 at 5:22 AM
> *To: *Slurm User Community List <slurm-users at lists.schedmd.com>
> *Cc: *"slurm-users at schedmd.com" <slurm-users at schedmd.com>
> *Subject: *Re: [slurm-users] Distribute jobs in similar nodes in the 
> same partition
>
> You can use node feature in defining the node types in slurm.conf.
>
> Then when requesting for the job,  use -C <feature name> toy just use 
> those node type.
>
> On Fri, May 11, 2018, 5:38 AM Antonio Lara <antonio.lara at uam.es 
> <mailto:antonio.lara at uam.es>> wrote:
>
>     Hello everyone,
>
>     Hopefully someone can help me with this, I cannot find in the
>     manual if
>     this is even possible:
>
>     I'm a system administrator, and the following question is from the
>     administrator point of view, not the user's point of view:
>
>     I work with a cluster which has a partition containing many nodes.
>     These
>     nodes belong to "different categories". This is, we bought at once
>     several machines that are of the same type, and we did this several
>     times. So, for example, we have 10 machines of type A, 20 machines of
>     type B and 15 machines of type C. Machines of type A are more
>     powerful
>     than machines of type B, which are more powerful than machines of
>     type C.
>
>     What I am trying to achieve is that Slurm "forces" parallelized
>     jobs to
>     be allocated in machines of the same type, if possible. That is, that
>     there is some type of priority which tries to allocate only
>     machines of
>     type A, or only machines of type B, or only of type C, and only
>     distribute jobs among machines of different types when there are not
>     enough nodes of the same type available.
>
>     Does anyone know if this is possible? The idea behind this is that
>     slower machines are not delaying the calculations in faster machines
>     when a job is distributed among them, and all machines work more
>     or less
>     at the same pace.
>
>     I've been told that It is NOT an option to create different
>     partitions,
>     each containing only one type of machine.
>
>     Please, note that I'm not looking for a way to choose as a user which
>     nodes to use for a job, what I need is that slurm does that, and
>     decides
>     what nodes to use, using similar nodes if available.
>
>     The closest that I could find in the manual was using consumable
>     resources, but I think this is not what I need, there are several
>     examples, but they don't seem to fit with this.
>
>     Thank you for your help!
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180511/9dffce48/attachment.html>


More information about the slurm-users mailing list