<div dir="ltr">Hi,<div><br></div><div>When a user requests all of the GPUs on a system, but less than the total number of CPUs, the CPU bindings aren't ideal</div><div><br></div><div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px"><div><div>[root@host ~]# nvidia-smi topo -m</div><div><span style="white-space:pre">       </span>GPU0<span style="white-space:pre"> </span>GPU1<span style="white-space:pre"> </span>GPU2<span style="white-space:pre"> </span>GPU3<span style="white-space:pre"> </span>mlx5_3<span style="white-space:pre">       </span>mlx5_1<span style="white-space:pre">       </span>mlx5_2<span style="white-space:pre">       </span>mlx5_0<span style="white-space:pre">       </span>CPU Affinity</div><div>GPU0<span style="white-space:pre">      </span> X <span style="white-space:pre">  </span>PHB<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>PHB<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>PHB<span style="white-space:pre">  </span>0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22</div><div>GPU1<span style="white-space:pre">     </span>PHB<span style="white-space:pre">  </span> X <span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>PHB<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>PHB<span style="white-space:pre">  </span>0-0,2-2,4-4,6-6,8-8,10-10,12-12,14-14,16-16,18-18,20-20,22-22</div><div>GPU2<span style="white-space:pre">     </span>SYS<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span> X <span style="white-space:pre">  </span>PHB<span style="white-space:pre">  </span>PHB<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>PHB<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23</div><div>GPU3<span style="white-space:pre">     </span>SYS<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>PHB<span style="white-space:pre">  </span> X <span style="white-space:pre">  </span>PHB<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>PHB<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>1-1,3-3,5-5,7-7,9-9,11-11,13-13,15-15,17-17,19-19,21-21,23-23</div><div>mlx5_3<span style="white-space:pre">   </span>SYS<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>PHB<span style="white-space:pre">  </span>PHB<span style="white-space:pre">  </span> X <span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>PIX<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span></div><div>mlx5_1<span style="white-space:pre">        </span>PHB<span style="white-space:pre">  </span>PHB<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span> X <span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>PIX<span style="white-space:pre">  </span></div><div>mlx5_2<span style="white-space:pre">        </span>SYS<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>PHB<span style="white-space:pre">  </span>PHB<span style="white-space:pre">  </span>PIX<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span> X <span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span></div><div>mlx5_0<span style="white-space:pre">        </span>PHB<span style="white-space:pre">  </span>PHB<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span>PIX<span style="white-space:pre">  </span>SYS<span style="white-space:pre">  </span> X </div></div><div><br></div><div><div>$ cat /usr/local/slurm/etc/gres.conf </div><div>NodeName=host Name=gpu Type=p100 File=/dev/nvidia0 Cores=0,2,4,6,8,10,12,14,16,18,20,22</div><div>NodeName=host Name=gpu Type=p100 File=/dev/nvidia1 Cores=0,2,4,6,8,10,12,14,16,18,20,22</div><div>NodeName=host Name=gpu Type=p100 File=/dev/nvidia2 Cores=1,3,5,7,9,11,13,15,17,19,21,23</div><div>NodeName=host Name=gpu Type=p100 File=/dev/nvidia3 Cores=1,3,5,7,9,11,13,15,17,19,21,23</div></div><div><br></div><div><div>[scrosby@thespian ~]$ sinteractive -n 20 --gres=gpu:p100:4</div><div>srun: job 612 queued and waiting for resources</div><div>srun: job 612 has been allocated resources</div><div>[scrosby@host ~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_10255/job_612/cpuset.cpus </div><div>0-16,18,20,22</div></div><div><br></div></blockquote>It should ideally be using CPUs 0-19 (split evenly across NUMA nodes).</div><div><br></div><div>I've tried forcing it with this</div><div><br></div><blockquote style="margin:0 0 0 40px;border:none;padding:0px"><div><div>[scrosby@thespian ~]$ sinteractive -n 20 --gres=gpu:p100:4 --cpu_bind=map_cpu:0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19</div></div><div><div>srun: job 614 queued and waiting for resources</div></div><div><div>srun: job 614 has been allocated resources</div></div><div><br></div></blockquote>But the resultant CPU binding is still the same<div><br><blockquote style="margin:0 0 0 40px;border:none;padding:0px"><div><div>[scrosby@host ~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_10255/job_614/cpuset.cpus </div></div><div><div>0-16,18,20,22</div></div></blockquote><div><br></div><div>Is there any way to force the CPU bindings of a particular job?</div><div><br></div><div>Cheers,<br>Sean</div><div><br></div><div><div><br></div><blockquote style="margin:0px 0px 0px 40px;border:none;padding:0px"><div><br></div><div><br></div><div><br></div></blockquote></div></div></div>