[slurm-users] Job allocation from a heterogenous pool of nodes
Brian Andrus
toomuchit at gmail.com
Wed Dec 7 17:27:01 UTC 2022
You may want to look here:
https://slurm.schedmd.com/heterogeneous_jobs.html
Brian Andrus
On 12/7/2022 12:42 AM, Le, Viet Duc wrote:
>
> Dear slurm community,
>
>
> I am encountering a unique situation where I need to allocate jobs to
> nodes with different numbers of CPU cores. For instance:
>
> node01: Xeon 6226 32 cores
>
> node02: EPYC 7543 64 cores
>
>
> $ salloc
> --partition=all --nodes=2 --nodelist=gpu01,gpu02 --ntasks-per-node=32 --comment=etc
>
> If --ntasks-per-node is larger than 32, the job could not be allocated
> since node01 has only 32 cores.
>
>
> In the context of NVIDIA's HPL container, we need to pin MPI
> processes according to NUMA affinity for best performance.
>
> For HGX-1, there are 8 A100s having affinity with 1st, 3rd, 5th, and
> 7th NUMA domain, respectively.
>
> With --ntasks-per-node=32, only the first half of EPYC's NUMA domain
> is available, and we had to assign the 4-7th A100 to 0th and 2nd NUMA
> domain, leading to some performance degradation.
>
>
> I am looking for a way to request more tasks than the number of
> physically available cores, i.e.
>
> $ salloc
> --partition=all --nodes=2 --nodelist=gpu01,gpu02 --ntasks-per-node=64--comment=etc
>
>
> Your suggestions are much appreciated.
>
>
> Regards,
>
> Viet-Duc
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221207/9b30be5a/attachment.htm>
More information about the slurm-users
mailing list