[slurm-users] Help with binding GPUs to sockets (NVlink, P2P)

Daniel Vecerka vecerka at fel.cvut.cz
Fri Jun 28 07:27:05 UTC 2019


  I'm not sure how it works in 19.0.5, but with 18.x  it's possible to 
specify CPU affinity in the file  /etc/slurm/gres.conf
Name=gpu Type=v100 File=/dev/nvidia0 CPUs=0-17,36-53
Name=gpu Type=v100 File=/dev/nvidia1 CPUs=0-17,36-53
Name=gpu Type=v100 File=/dev/nvidia2 CPUs=18-35,54-71
Name=gpu Type=v100 File=/dev/nvidia3 CPUs=18-35,54-71

The CPUs number you can get with command:

nvidia-smi topo -m
         GPU0    GPU1    GPU2    GPU3    mlx5_0  CPU Affinity
GPU0     X      NV2     NV2     NV2     NODE    0-17,36-53
GPU1    NV2      X      NV2     NV2     NODE    0-17,36-53
GPU2    NV2     NV2      X      NV2     SYS     18-35,54-71
GPU3    NV2     NV2     NV2      X      SYS     18-35,54-71
mlx5_0  NODE    NODE    SYS     SYS      X

Best regards, Daniel

On 27.06.2019 15:17, Luis Altenkort wrote:
> Hello everyone,
> I have several nodes with 2 sockets each and 4 GPUs per Socket (i.e. 8 
> GPUs per bode). I now want to tell SLURM that GPUs with device ID 
> 0,1,2,3 are connected to socket 0 and GPUs 4,5,6,7 are connected to 
> socket 1. I want to do this in order to be able to use the new command 
> --gpus-per-socket. All GPUs on one socket are directly linked via 
> NVlink and use P2P communication. In the end I want to be able to run 
> multi-gpu jobs on GPUs that are all on one socket (and not distributed 
> accross sockets or nodes). What do I have to change in my slurm.conf 
> and how would I submit jobs? Like this?:
> #!/bin/bash
> #SBATCH --job-name=test
> #SBATCH --partition=volta
> #SBATCH --gpus-per-socket=4
> #SBATCH --ntasks=1
> #SBATCH --cpus-per-task=1
> Slurm is on version 19.05.0.
> Thanks in advance!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3726 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190628/85e4dfa1/attachment.bin>

More information about the slurm-users mailing list