[slurm-users] How to avoid a feature?
tina.friedrich at it.ox.ac.uk
Fri Jul 2 11:42:19 UTC 2021
we didn't want to have too many partitions, mainly; so we were after a
way to have the GPU nodes not separated out.
Partly it is because we wanted to be able to easily use 'idle' CPUs on
GPU nodes - although I currently only allow that on some of them (I
simply also tag them with 'cpu'). Having them in a separate partition
would mean users would have to change where they submit to, or I would
have to mess with that in the verifier...
Also - for various reasons, we'd end up with a lot of partitions
(something like 10 or 12) - that seemed a lot of partitions. We liked it
better having the GPU nodes not separated out & teach users to specify
their resources properly (the GPUs are a very mixed bunch, as well.)
We did think about having 'hidden' GPU partitions instead of wrangling
it with features, but there didn't seem to be any benefit to that that
we could see.
On 02/07/2021 06:48, Loris Bennett wrote:
> Hi Tina,
> Tina Friedrich <tina.friedrich at it.ox.ac.uk> writes:
>> Hi Brian,
>> sometimes it would be nice if SLURM had what Grid Engine calls a 'forced
>> complex' (i.e. a feature that you *have* to request to land on a node that has
>> it), wouldn't it?
>> I do something like that for all of my 'special' nodes (GPU, KNL, nodes...) - I
>> want to avoid jobs not requesting that resource or allowing that architecture
>> landing on it. I 'tag' all nodes with a relevant feature (cpu, gpu, knl, ...),
>> and have a LUA submit verifier that checks for a 'relevant' feature (or a
>> --gres=gpu or somthing) and if there isn't one I add the 'cpu' feature to the
>> Works for us!
> We just have the GPU nodes in a separate partition 'gpu' which users
> have to specify if they want a GPU. How does that approach differ from
> yours in terms of functionality for you (or the users)?
> The main problem with our approach is that the CPUs on the GPU nodes can
> remain idle while there is a queue for the regular CPU nodes. What I
> would like is to allow short CPU-only jobs to run on the GPUs but only
> allow GPU-jobs to run for longer, which I guess I could probably do
> within the submit plugin.
>> On 01/07/2021 15:08, Brian Andrus wrote:
>>> I have a partition where one of the nodes has a node-locked license.
>>> That license is not used by everyone that uses the partition.
>>> They are cloud nodes, so weights do not work (there is an open bug about
>>> I need to have jobs 'avoid' that node by default. I am thinking I can use a
>>> feature constraint, but that seems to only apply to those that want the
>>> feature. Since we have so many other users, it isn't feasible to have them
>>> modify their scripts, so having it avoid by default would work.
>>> Any ideas how to do that? Submit LUA perhaps?
>>> Brian Andrus
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, University of Oxford
More information about the slurm-users