[slurm-users] Keep CPU Jobs Off GPU Nodes

Wed Mar 29 08:08:12 UTC 2023

Hello,

maybe some additional notes:

While the cited procedure works great in general, it gets more 
complicated for heterogeneous setups, i.e. if you have several GPU types 
defined in gres.conf, since the 'tres_per_<x>' fields can then take the 
form of either 'gres:gpu:N' or 'gres:gpu:<type>:N' - depending on 
whether the job script specifies a GPU type or not.
Of course, you could omit the GPU type definition in gres.conf and 
define the type as a node feature instead, as long as no nodes contain 
multiple different GPU types.
Since the latter is the case in our cluster, I instead opted to check 
only for the existence of 'gpu' in the 'tres_per_<x>' fields and to not 
bother with parsing the actual number of GPUs. However, there is an 
interesting edge case here, as users are free to set --gpus=0 - either 
one has to filter for that specifically, or instruct one's users to not 
do that.

Kind Regards,
René Sitt

Am 29.03.23 um 08:57 schrieb Ward Poelmans:

> Hi,
>
> We have a dedicated partitions for GPUs (their name ends with _gpu) 
> and simply forbid a job that is not requesting GPU resources to use 
> this partition:
>
> local function job_total_gpus(job_desc)
>     -- return total number of GPUs allocated to the job
>     -- there are many ways to request a GPU. This comes from the 
> job_submit example in the slurm source
>     -- a GPU resource is either nil or "gres:gpu:N", with N the number 
> of GPUs requested
>
>     -- pick relevant job resources for GPU spec (undefined resources 
> can show limit values)
>     gpu_specs = {
>         ['tres_per_node'] = 1,
>         ['tres_per_task'] = 1,
>         ['tres_per_socket'] = 1,
>         ['tres_per_job'] = 1,
>     }
>
>     -- number of nodes
>     if job_desc['min_nodes'] < 0xFFFFFFFE then 
> gpu_specs['tres_per_node'] = job_desc['min_nodes'] end
>     -- number of tasks
>     if job_desc['num_tasks'] < 0xFFFFFFFE then 
> gpu_specs['tres_per_task'] = job_desc['num_tasks'] end
>     -- number of sockets
>     if job_desc['sockets_per_node'] < 0xFFFE then 
> gpu_specs['tres_per_socket'] = job_desc['sockets_per_node'] end
>     gpu_specs['tres_per_socket'] = gpu_specs['tres_per_socket'] * 
> gpu_specs['tres_per_node']
>
>     gpu_options = {}
>     for tres_name, _ in pairs(gpu_specs) do
>         local num_gpus = string.match(tostring(job_desc[tres_name]), 
> "^gres:gpu:([0-9]+)") or 0
>         gpu_options[tres_name] = tonumber(num_gpus)
>     end
>     -- calculate total GPUs
>     for tres_name, job_res in pairs(gpu_specs) do
>         local num_gpus = gpu_options[tres_name]
>         if num_gpus > 0 then
>             total_gpus = num_gpus * tonumber(job_res)
>             return total_gpus
>         end
>     end
>     return 0
> end
>
>
>
> function slurm_job_submit(job_desc, part_list, submit_uid)
>     local total_gpus = job_total_gpus(job_desc)
>     slurm.log_debug("Job total number of GPUs: %s", 
> tostring(total_gpus));
>
>     if total_gpus == 0 then
>         for partition in string.gmatch(tostring(job_desc.partition), 
> '([^,]+)') do
>             if string.match(partition, '_gpu$') then
>                 slurm.log_user(string.format('ERROR: GPU partition %s 
> is not allowed for non-GPU jobs.', partition))
>                 return ESLURM_INVALID_GRES
>             end
>         end
>     end
>
>     return slurm.SUCCESS
> end
>
>
>
> Ward
>
> On 29/03/2023 01:24, Frank Pari wrote:
>> Well, I wanted to avoid using lua.  But, it looks like that's going 
>> to be the easiest way to do this without having to create a separate 
>> partition for the GPUs. Basically, check for at least one gpu in the 
>> job submission and if none exclude all GPU nodes for the job.
>>
>> image.png
>>
>> Now I'm wondering how to auto-gen the list of nodes with GPUs, so I 
>> don't have to remember to update job_submit.lua everytime we get new 
>> GPU nodes.
>>
>> -F
>>
>> On Tue, Mar 28, 2023 at 4:06 PM Frank Pari <parif at bc.edu 
>> <mailto:parif at bc.edu>> wrote:
>>
>>     Hi all,
>>
>>     First, thank you all for participating in this list.  I've 
>> learned so much by just following in other's threads.  =)
>>
>>     I'm looking at creating a scavenger partition with idle resources 
>> from CPU and GPU nodes and I'd like to keep this to one partition.  
>> But, I don't want CPU only jobs using up resources on the GPU nodes.
>>
>>     I've seen suggestions for job/lua scripts.  But, I'm wondering if 
>> there's any other way to ensure a job has requested at least 1 gpu 
>> for the scheduler to assign that job to a GPU node.
>>
>>     Thanks in advance!
>>
>>     -Frank
>>
>
-- 
Dipl.-Chem. René Sitt
Hessisches Kompetenzzentrum für Hochleistungsrechnen
Philipps-Universität Marburg
Hans-Meerwein-Straße
35032 Marburg

Tel. +49 6421 28 23523
sittr at hrz.uni-marburg.de
www.hkhlr.de

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4239 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230329/e33cd323/attachment.bin>