[slurm-users] Distribute jobs in similar nodes in the same partition
Antonio Lara
antonio.lara at uam.es
Fri May 11 09:49:52 MDT 2018
Thank you all for your answers, I will research some more along these lines!
Any other opinion is welcome
Regards,
Antonio
El 11/05/18 a las 16:05, Vicker, Darby (JSC-EG311) escribió:
>
> I’ll second that – we have a cluster with 4 generations of nodes. We
> assign a processor type feature to each node and require the users to
> ask for at least one of those features in their jobs via
> job_submit.lua – see the code below. For a job that can run on any
> processor type, you can use this:
>
> #SBATCH --constraint=[wes|san|has|bro]
>
> See the constraint section of “man sbatch” for more details but this
> will constrain the job to any processor type but all nodes of one
> type. It really works great from a utilization standpoint – jobs will
> run on the first processor type that is free.
>
> local feature_count = 0
>
> if job_desc ~= nil and job_desc.features ~= nil then
>
> if string.match(job_desc.features, "wes") then
> feature_count=feature_count+1 end
>
> if string.match(job_desc.features, "san") then
> feature_count=feature_count+1 end
>
> if string.match(job_desc.features, "has") then
> feature_count=feature_count+1 end
>
> if string.match(job_desc.features, "bro") then
> feature_count=feature_count+1 end
>
> end
>
> if feature_count > 0 then
>
> slurm.log_info("Found %s valid cpu features",feature_count)
>
> else
>
> slurm.log_user("Invalid features - aerolab policy requires specifying
> one or more of wes,san,has,bro.")
>
> slurm.log_error("Found %s cpu features from %s",feature_count,submit_uid)
>
> -- See slurm/slurm_errno.h and src/common/slurm_errno.c
>
> -- for the list of error codes and messages.
>
> return 2002
>
> end
>
> Of course, the user can leave off the square brackets and get any mix
> of processor types. We have some codes that run fine across different
> processor types so we allow this. Our group is small enough that we
> can easily educate and police the users to do the right thing. But
> you could add more logic to job_submit.lua to require the brackets if
> you wanted to.
>
> Darby
>
> *From: *slurm-users <slurm-users-bounces at lists.schedmd.com> on behalf
> of Hadrian Djohari <hxd58 at case.edu>
> *Reply-To: *Slurm User Community List <slurm-users at lists.schedmd.com>
> *Date: *Friday, May 11, 2018 at 5:22 AM
> *To: *Slurm User Community List <slurm-users at lists.schedmd.com>
> *Cc: *"slurm-users at schedmd.com" <slurm-users at schedmd.com>
> *Subject: *Re: [slurm-users] Distribute jobs in similar nodes in the
> same partition
>
> You can use node feature in defining the node types in slurm.conf.
>
> Then when requesting for the job, use -C <feature name> toy just use
> those node type.
>
> On Fri, May 11, 2018, 5:38 AM Antonio Lara <antonio.lara at uam.es
> <mailto:antonio.lara at uam.es>> wrote:
>
> Hello everyone,
>
> Hopefully someone can help me with this, I cannot find in the
> manual if
> this is even possible:
>
> I'm a system administrator, and the following question is from the
> administrator point of view, not the user's point of view:
>
> I work with a cluster which has a partition containing many nodes.
> These
> nodes belong to "different categories". This is, we bought at once
> several machines that are of the same type, and we did this several
> times. So, for example, we have 10 machines of type A, 20 machines of
> type B and 15 machines of type C. Machines of type A are more
> powerful
> than machines of type B, which are more powerful than machines of
> type C.
>
> What I am trying to achieve is that Slurm "forces" parallelized
> jobs to
> be allocated in machines of the same type, if possible. That is, that
> there is some type of priority which tries to allocate only
> machines of
> type A, or only machines of type B, or only of type C, and only
> distribute jobs among machines of different types when there are not
> enough nodes of the same type available.
>
> Does anyone know if this is possible? The idea behind this is that
> slower machines are not delaying the calculations in faster machines
> when a job is distributed among them, and all machines work more
> or less
> at the same pace.
>
> I've been told that It is NOT an option to create different
> partitions,
> each containing only one type of machine.
>
> Please, note that I'm not looking for a way to choose as a user which
> nodes to use for a job, what I need is that slurm does that, and
> decides
> what nodes to use, using similar nodes if available.
>
> The closest that I could find in the manual was using consumable
> resources, but I think this is not what I need, there are several
> examples, but they don't seem to fit with this.
>
> Thank you for your help!
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180511/9dffce48/attachment.html>
More information about the slurm-users
mailing list