[slurm-users] Reserve some cores per GPU

Relu Patrascu relu at cs.toronto.edu
Tue Oct 20 21:47:52 UTC 2020


I thought of doing this, but, I'm guessing you don't have preemption 
enabled.

With preemption enabled this becomes more complicated, and error prone, but

I'll think some more about it. It'd be nice leverage slurm's scheduling 
engine and

just add this constraint.

Relu

On 2020-10-20 16:20, Aaron Jackson wrote:
> I look after a very heterogeneous GPU Slurm setup and some nodes have
> quite few cores. We use a job_submit lua script which calculates the
> number of requested cpu cores per gpu. This is then used to scan through
> a table of 'weak nodes' based on a 'max cores per gpu' property. The
> node names are appended to the job desc exc_nodes property.
>
> It's not particularly elegant but it does work quite well for us.
>
> Aaron
>
>
> On 20 October 2020 at 18:17 BST, Relu Patrascu wrote:
>
>> Hi all,
>>
>> We have a GPU cluster and have run into this issue occasionally. Assume
>> four GPUs per node; when a user requests a GPU on such a node, and all
>> the cores, or all the RAM, the other three GPUs will be wasted for the
>> duration of the job, as slurm has no more cores or RAM available to
>> allocate those GPUs to subsequent jobs.
>>
>>
>> We have a "soft" solution to this, but it's not ideal. That is, we
>> assigned large TresBillingWeights to cpu consumption, thus discouraging
>> users to allocate many CPUs.
>>
>>
>> Ideal for us would be to be able to define a number of CPUs to always be
>> available on a node, for each GPU. Would help to a similar feature for
>> an amount of RAM.
>>
>>
>> Take for example a node that has:
>>
>> * four GPUs
>>
>> * 16 CPUs
>>
>>
>> Let's assume that most jobs would work just fine with a minimum number
>> of 2 CPUs per GPU. Then we could set in the node definition a variable
>> such as
>>
>>    CpusReservedPerGpu = 2
>>
>> The first job to run on this node could get between 2 and 10 CPUs, thus
>> 6 CPUs remaining for potential incoming jobs (2 per GPU).
>>
>>
>> We couldn't find a way to do this, are we missing something? We'd rather
>> not modify the source code again :/
>>
>> Regards,
>>
>> Relu
>



More information about the slurm-users mailing list