[slurm-users] Core reserved/bound to a GPU

Manuel Bertrand Manuel.Bertrand at lis-lab.fr
Fri Sep 4 13:08:59 UTC 2020


On 01/09/2020 06:36, Chris Samuel wrote:
> On Monday, 31 August 2020 7:41:13 AM PDT Manuel BERTRAND wrote:
>
>> Every thing works great so far but now I would like to bound a specific
>> core to each GPUs on each node. By "bound" I mean to make a particular
>> core not assignable to a CPU job alone so that the GPU is available
>> whatever the CPU workload on the node.
> What I've done in the past (waves to Swinburne folks on the list) was to have
> overlapping partitions on GPU nodes where the GPU job partition had access to
> all the cores and the CPU only job partition had access to only a subset
> (limited by the MaxCPUsPerNode parameter on the partition).
>

Thanks for this suggestion but it leads to another problem:
the total number of cores is quite different on the nodes, ranging from 
12 to 20.
So as the MaxCPUsPerNode parameter will be enforced on all the nodes in 
the partition, I will need to adjust it for the GPU node with the 
smallest number of cores (here 12 with 2 GPUs, so with 2 cores to be 
reserved: MaxCPUsPerNode=10) and so I'll lose up to 10 cores on the 20 
cores node :(

What do you think of the idea to enforce this only on the "Default" 
partition (GPU + CPU nodes) so that if a user need a full cores set he 
must specify the partition ie. "cpu" / "gpu" ?
Here is my current partitions declaration:
PartitionName=cpu Nodes=cpunode1,cpunode2,cpunode3,cpunode4,cpunode5 
Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP
PartitionName=gpu 
Nodes=gpunode1,gpunode2,gpunode3,gpunode4,gpunode5,gpunode6,gpunode7,gpunode8 
Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP
PartitionName=all Nodes=ALL Default=YES DefaultTime=60 MaxTime=168:00:00 
State=UP

So instead of enforcing the limit directly on the CPU partition and 
adding to it all the GPU nodes, I would do it on the "Default" one (here 
named "all") like this:
PartitionName=all Nodes=ALL Default=YES DefaultTime=60 MaxTime=168:00:00 
State=UP MaxCPUsPerNode=10

It seems quite hackish...




More information about the slurm-users mailing list