[slurm-users] [External] Submitting to multiple paritions problem with gres specified

Prentice Bisbal pbisbal at pppl.gov
Fri Mar 12 21:30:17 UTC 2021


On 3/9/21 3:16 AM, Ward Poelmans wrote:

> Hi Prentice,
>
> On 8/03/2021 22:02, Prentice Bisbal wrote:
>
>> I have a very hetergeneous cluster with several different generations of
>> AMD and Intel processors, we use this method quite effectively.
> Could you elaborate a bit more and how you manage that? Do you force you
> users to pick a feature? What if a user submits a multi node job, can
> you make sure it will not start over a mix of avx512 and avx2 nodes?

I don't force the users to pick a feature, and to make matters worse, I 
think our login nodes are newer than some of the compute nodes, so it's 
entirely possible that if someone really optimizes their code for one of 
the login nodes, their job could get assigned to a node that doesn't 
understand the instruction set, resulting in the dreaded "Illegal 
Instruction" error. Suprisingly, this has only happened a few times in 
the 5 years I've been at this job.

I assume most users would want to use the newest and fastest processors 
if given the choice, so I set the priority weighting of the nodes so 
that the newest nodes are highest priority, and the oldest nodes the 
lowest priority.

The only way to make sure the processors stick to a certain instruction 
set, is if they specify the processor model, rather then than the 
instruction set family. For example

-C 7281 will get you only AMD EPYC 7281 processors

and

-C 6376 will get you only AMD Opteron 6376 processors

Using your example, if you don't want to mix AVX2 and AVX512 processors 
in the same job ever, you can "lie" to Slurm in your topology file and 
come up with a topology where the two subsets of nodes can't talk to 
each other. That way, Slurm will not mix nodes of the different 
instruction sets. The problem with this is that it's a "permanent" 
solution - it's not flexible. I would imagine there are times when you 
would want to use both your AVX2 and AVX512 processors in a single job.

I do something like this because we have 10 nodes set aside for serial 
jobs that are connected only by 1 GbE. We obviously don't want internode 
jobs running there, so in my topology file, each of those nodes has it's 
own switch that's not connected to any other switch.

>
>> If you want to continue down the road you've already started on, can you
>> provide more information, like the partition definitions and the gres
>> definitions? In general, Slurm should support submitting to multiple
>> partitions.
> As far as I understood it, you can give a comma separated list of
> partitions to sbatch but it's not possible to this by default?


Incorrect. Giving a comma separated list is possible and is the default 
behavior for Slurm. From the sbatch documentation (emphasis added to the 
relevant sentence):

> *-p*, *--partition*=</partition_names/>
>     Request a specific partition for the resource allocation. If not
>     specified, the default behavior is to allow the slurm controller
>     to select the default partition as designated by the system
>     administrator. *If the job can use more than one partition,
>     specify their names in a comma separate list and the one offering
>     earliest initiation will be used with no regard given to the
>     partition name ordering (although higher priority partitions will
>     be considered first).* When the job is initiated, the name of the
>     partition used will be placed first in the job record partition
>     string.
>
But you can't have a job *span* multiple partitions, but I don't think 
that was ever your goal.


Prentice

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210312/725eaaad/attachment.htm>


More information about the slurm-users mailing list