<div dir="ltr"><div dir="ltr">Thanks Julie! Figured I was missing something.</div><div dir="ltr"><br></div><div>-Randy</div></div><br><div class="gmail_quote"><div dir="ltr">On Mon, Sep 17, 2018 at 8:52 PM Julie Bernauer <<a href="mailto:jbernauer@nvidia.com">jbernauer@nvidia.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div dir="ltr">
<div id="m_-4431268156887288663divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Helvetica,sans-serif" dir="ltr">
<p style="margin-top:0;margin-bottom:0"></p>
<div>Hi Randy, <br>
<br>
This is expected on an HT machine, like on the one described below. If you run lstopo, you see:<br>
L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5<br>
PU L#10 (P#5)<br>
PU L#11 (P#45)<br>
Slurm uses the logical cores so 10 and 11 gives you "physical" cores 5 and 45.<br>
<br>
Julie<br>
</div>
<br>
<p></p>
<br>
<br>
<div style="color:rgb(0,0,0)">
<hr style="display:inline-block;width:98%">
<div id="m_-4431268156887288663divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" color="#000000" face="Calibri, sans-serif"><b>From:</b> slurm-users <<a href="mailto:slurm-users-bounces@lists.schedmd.com" target="_blank">slurm-users-bounces@lists.schedmd.com</a>> on behalf of Randall Radmer <<a href="mailto:radmer@gmail.com" target="_blank">radmer@gmail.com</a>><br>
<b>Sent:</b> Wednesday, September 12, 2018 10:14 AM<br>
<b>To:</b> <a href="mailto:slurm-users@lists.schedmd.com" target="_blank">slurm-users@lists.schedmd.com</a><br>
<b>Subject:</b> [slurm-users] Using GRES to manage GPUs, but unable to assign specific CPUs to specific GPUs</font>
<div> </div>
</div>
<div>
<div dir="ltr">
<div dir="ltr">
<div>I’m using GRES to manage eight GPUs in a node on a new Slurm cluster and am trying to bind specific CPUs to specific GPUs, but it’s not working as I expected.</div>
<div><br>
</div>
<div>I am able to request a specific number of GPUs, but the CPU assignment seems wrong.</div>
<div><br>
</div>
<div>I assume I’m missing something obvious, but just can't find it. Any suggestion for how to fix it, or how to better investigate the problem, would be much appreciated.</div>
<div><br>
</div>
<div>Example srun requesting one GPU follows:</div>
<div>$ srun -p dgx1 --gres=gpu:1 --pty $SHELL</div>
<div>[node-01:~]$ nvidia-smi --query-gpu=index,name --format=csv</div>
<div>index, name</div>
<div>0, Tesla V100-SXM2-16GB</div>
<div>[node-01:~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_*/job_*/cpuset.cpus</div>
<div>5,45</div>
<div><br>
</div>
<div>Similar example requesting eight GPUs follows:</div>
<div>$ srun -p dgx1 --gres=gpu:8 --pty $SHELL</div>
<div>[node-01:~]$ nvidia-smi --query-gpu=index,name --format=csv</div>
<div>index, name</div>
<div>0, Tesla V100-SXM2-16GB</div>
<div>1, Tesla V100-SXM2-16GB</div>
<div>2, Tesla V100-SXM2-16GB</div>
<div>3, Tesla V100-SXM2-16GB</div>
<div>4, Tesla V100-SXM2-16GB</div>
<div>5, Tesla V100-SXM2-16GB</div>
<div>6, Tesla V100-SXM2-16GB</div>
<div>7, Tesla V100-SXM2-16GB</div>
<div>[node-01:~]$ cat /sys/fs/cgroup/cpuset/slurm/uid_*/job_*/cpuset.cpus</div>
<div>5,45</div>
<div><br>
</div>
<div>The machines are all Ubuntu 16.04 and Slurm version is 17.11.9-2.</div>
<div><br>
</div>
<div>The /etc/slurm/gres.conf file follows:</div>
<div>[node-01:~]$ less /etc/slurm/gres.conf</div>
<div>Name=gpu Type=V100 File=/dev/nvidia0 Cores=10-11</div>
<div>Name=gpu Type=V100 File=/dev/nvidia1 Cores=12-13</div>
<div>Name=gpu Type=V100 File=/dev/nvidia2 Cores=14-15</div>
<div>Name=gpu Type=V100 File=/dev/nvidia3 Cores=16-17</div>
<div>Name=gpu Type=V100 File=/dev/nvidia4 Cores=18-19</div>
<div>Name=gpu Type=V100 File=/dev/nvidia5 Cores=20-21</div>
<div>Name=gpu Type=V100 File=/dev/nvidia6 Cores=22-23</div>
<div>Name=gpu Type=V100 File=/dev/nvidia7 Cores=24-25</div>
<div><br>
</div>
<div>The /etc/slurm/slurm.conf file on all machines in the cluster follows (with minor cleanup):</div>
<div>ClusterName=testcluster</div>
<div>ControlMachine=slurm-master</div>
<div>SlurmUser=slurm</div>
<div>SlurmctldPort=6817</div>
<div>SlurmdPort=6818</div>
<div>AuthType=auth/munge</div>
<div>SlurmdSpoolDir=/var/spool/slurm/d</div>
<div>SwitchType=switch/none</div>
<div>MpiDefault=none</div>
<div>SlurmctldPidFile=/var/run/slurmctld.pid</div>
<div>SlurmdPidFile=/var/run/slurmd.pid</div>
<div>ProctrackType=proctrack/cgroup</div>
<div>PluginDir=/usr/lib/slurm</div>
<div>ReturnToService=2</div>
<div>Prolog=/etc/slurm/slurm.prolog</div>
<div>PrologSlurmctld=/etc/slurm/slurm.ctld.prolog</div>
<div>Epilog=/etc/slurm/slurm.epilog</div>
<div>EpilogSlurmctld=/etc/slurm/slurm.ctld.epilog</div>
<div>TaskProlog=/etc/slurm/slurm.task.prolog</div>
<div>TaskPlugin=task/affinity,task/cgroup</div>
<div>SlurmctldTimeout=300</div>
<div>SlurmdTimeout=300</div>
<div>InactiveLimit=0</div>
<div>MinJobAge=300</div>
<div>KillWait=20</div>
<div>Waittime=0</div>
<div>SchedulerType=sched/backfill</div>
<div>SelectType=select/cons_res</div>
<div>SelectTypeParameters=CR_Core_Memory</div>
<div>FastSchedule=0</div>
<div>DebugFlags=CPU_Bind,gres</div>
<div>SlurmctldDebug=debug5</div>
<div>SlurmctldLogFile=/var/log/slurm/slurmctld.log</div>
<div>SlurmdDebug=3</div>
<div>SlurmdLogFile=/var/log/slurm/slurmd.log</div>
<div>JobCompType=jobcomp/filetxt</div>
<div>JobCompLoc=/data/slurm/job_completions.log</div>
<div>AccountingStorageType=accounting_storage/slurmdbd</div>
<div>AccountingStorageLoc=/data/slurm/accounting_storage.log</div>
<div>AccountingStorageEnforce=associations,limits,qos</div>
<div>AccountingStorageTRES=gres/gpu,gres/gpu:V100</div>
<div>PreemptMode=SUSPEND,GANG</div>
<div>PrologFlags=Serial,Alloc</div>
<div>RebootProgram="/sbin/shutdown -r 3"</div>
<div>PreemptType=preempt/partition_prio</div>
<div>CacheGroups=0</div>
<div>DefMemPerCPU=2048</div>
<div>GresTypes=gpu</div>
<div>NodeName=node-01 State=UNKNOWN \</div>
<div> Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 \</div>
<div> Gres=gpu:V100:8</div>
<div>PartitionName=all Nodes=node-01 \</div>
<div> Default=YES MaxTime=4:0:0 DefaultTime=4:0:0 State=UP</div>
<div><br>
</div>
<div><br>
</div>
<div>Thanks,</div>
<div>Randy</div>
<div><br>
</div>
</div>
</div>
</div>
</div>
</div>
<div>
<hr>
</div>
<div>This email message is for the sole use of the intended recipient(s) and may
contain confidential information. Any unauthorized review, use, disclosure
or distribution is prohibited. If you are not the intended recipient,
please contact the sender by reply email and destroy all copies of the original
message. </div>
<div>
<hr>
</div>
</div>
</blockquote></div>