[slurm-users] Job allocation from a heterogenous pool of nodes

Wed Dec 7 17:27:01 UTC 2022

You may want to look here:

https://slurm.schedmd.com/heterogeneous_jobs.html

Brian Andrus

On 12/7/2022 12:42 AM, Le, Viet Duc wrote:
>
> Dear slurm community,
>
>
> I am encountering a unique situation where I need to allocate jobs to 
> nodes with different numbers of CPU cores. For instance:
>
> node01: Xeon 6226 32 cores
>
> node02: EPYC 7543 64 cores
>
>
> $ salloc 
> --partition=all --nodes=2 --nodelist=gpu01,gpu02 --ntasks-per-node=32 --comment=etc
>
> If --ntasks-per-node is larger than 32, the job could not be allocated 
> since node01 has only 32 cores.
>
>
> In the context of NVIDIA's HPL container, we need to pin MPI 
> processes according to NUMA affinity for best performance.
>
> For HGX-1, there are 8 A100s having affinity with 1st, 3rd, 5th, and 
> 7th NUMA domain, respectively.
>
> With --ntasks-per-node=32, only the first half of EPYC's NUMA domain 
> is available, and we had to assign the 4-7th A100 to 0th and 2nd NUMA 
> domain, leading to some performance degradation.
>
>
> I am looking for a way to request more tasks than the number of 
> physically available cores, i.e.
>
> $ salloc 
> --partition=all --nodes=2 --nodelist=gpu01,gpu02 --ntasks-per-node=64--comment=etc
>
>
> Your suggestions are much appreciated.
>
>
> Regards,
>
> Viet-Duc
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221207/9b30be5a/attachment.htm>