[slurm-users] Core reserved/bound to a GPU
Manuel Bertrand
Manuel.Bertrand at lis-lab.fr
Fri Sep 4 13:08:59 UTC 2020
On 01/09/2020 06:36, Chris Samuel wrote:
> On Monday, 31 August 2020 7:41:13 AM PDT Manuel BERTRAND wrote:
>
>> Every thing works great so far but now I would like to bound a specific
>> core to each GPUs on each node. By "bound" I mean to make a particular
>> core not assignable to a CPU job alone so that the GPU is available
>> whatever the CPU workload on the node.
> What I've done in the past (waves to Swinburne folks on the list) was to have
> overlapping partitions on GPU nodes where the GPU job partition had access to
> all the cores and the CPU only job partition had access to only a subset
> (limited by the MaxCPUsPerNode parameter on the partition).
>
Thanks for this suggestion but it leads to another problem:
the total number of cores is quite different on the nodes, ranging from
12 to 20.
So as the MaxCPUsPerNode parameter will be enforced on all the nodes in
the partition, I will need to adjust it for the GPU node with the
smallest number of cores (here 12 with 2 GPUs, so with 2 cores to be
reserved: MaxCPUsPerNode=10) and so I'll lose up to 10 cores on the 20
cores node :(
What do you think of the idea to enforce this only on the "Default"
partition (GPU + CPU nodes) so that if a user need a full cores set he
must specify the partition ie. "cpu" / "gpu" ?
Here is my current partitions declaration:
PartitionName=cpu Nodes=cpunode1,cpunode2,cpunode3,cpunode4,cpunode5
Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP
PartitionName=gpu
Nodes=gpunode1,gpunode2,gpunode3,gpunode4,gpunode5,gpunode6,gpunode7,gpunode8
Default=NO DefaultTime=60 MaxTime=168:00:00 State=UP
PartitionName=all Nodes=ALL Default=YES DefaultTime=60 MaxTime=168:00:00
State=UP
So instead of enforcing the limit directly on the CPU partition and
adding to it all the GPU nodes, I would do it on the "Default" one (here
named "all") like this:
PartitionName=all Nodes=ALL Default=YES DefaultTime=60 MaxTime=168:00:00
State=UP MaxCPUsPerNode=10
It seems quite hackish...
More information about the slurm-users
mailing list