<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><br></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)">Hi all</pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><br></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)">I'm quite new to Slurm, and have set up an Ubuntu box with 5 A40 GPU's</pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)">Allocating one or more GPU's with --gres=gpu:1 (or --gres=gpu:2 ) works great!</pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><br></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)">But we have a number of tasks that only use e.g. 50% of the resources of one GPU. So in this case, </pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)">we would like to be able to submit 10 jobs with --gres=mps:50 that should automatically <span style="font-family: courier, "courier new", monospace; font-size: 14px; color: rgb(0, 0, 0);">be allocated </span></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)">as two to each GPU. </pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)">But I run into exatcly the same problem as Geoffrey described last year (see below): </pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)">The process works great for the two jobs allocated to the first GPU, </pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)">but subsequent jobs are queued instead of starting on the next GPU.</pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><span style="font-family: courier, "courier new", monospace; font-size: 14px; color: rgb(0, 0, 0);"><br></span></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><span style="font-family: courier, "courier new", monospace; font-size: 14px; color: rgb(0, 0, 0);">I am running the Nvidia MPS server, and nvidia-smi looks ok:</span></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><span style="font-family: courier, "courier new", monospace; font-size: 14px; color: rgb(0, 0, 0);"><br></span></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><span style="font-family: courier, "courier new", monospace; font-size: 14px; color: rgb(0, 0, 0);">+-----------------------------------------------------------------------------+<div>| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |</div><div>|-------------------------------+----------------------+----------------------+</div><div>| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |</div><div>| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |</div><div>| | | MIG M. |</div><div>|===============================+======================+======================|</div><div>| 0 A40 Off | 00000000:25:00.0 Off | 0 |</div><div>| 0% 28C P8 21W / 300W | 29MiB / 45634MiB | 0% Default |</div><div>| | | N/A |</div><div>+-------------------------------+----------------------+----------------------+</div><div>| 1 A40 Off | 00000000:81:00.0 Off | 0 |</div><div>| 0% 28C P8 24W / 300W | 30MiB / 45634MiB | 0% Default |</div><div>| | | N/A |</div><div>+-------------------------------+----------------------+----------------------+</div><div>| 2 A40 Off | 00000000:A1:00.0 Off | 0 |</div><div>| 0% 26C P8 29W / 300W | 30MiB / 45634MiB | 0% Default |</div><div>| | | N/A |</div><div>+-------------------------------+----------------------+----------------------+</div><div>| 3 A40 Off | 00000000:C1:00.0 Off | 0 |</div><div>| 0% 27C P8 31W / 300W | 30MiB / 45634MiB | 0% Default |</div><div>| | | N/A |</div><div>+-------------------------------+----------------------+----------------------+</div><div>| 4 A40 Off | 00000000:E1:00.0 Off | 0 |</div><div>| 0% 26C P8 23W / 300W | 30MiB / 45634MiB | 0% Default |</div><div>| | | N/A |</div><div>+-------------------------------+----------------------+----------------------+</div><div> </div><div>+-----------------------------------------------------------------------------+</div><div>| Processes: |</div><div>| GPU GI CI PID Type Process name GPU Memory |</div><div>| ID ID Usage |</div><div>|=============================================================================|</div><div>| 0 N/A N/A 36939 C nvidia-cuda-mps-server 27MiB |</div><div>| 1 N/A N/A 36939 C nvidia-cuda-mps-server 27MiB |</div><div>| 2 N/A N/A 36939 C nvidia-cuda-mps-server 27MiB |</div><div>| 3 N/A N/A 36939 C nvidia-cuda-mps-server 27MiB |</div><div>| 4 N/A N/A 36939 C nvidia-cuda-mps-server 27MiB |</div><span>+-----------------------------------------------------------------------------+</span><br></span></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><span style="font-family: courier, "courier new", monospace; font-size: 14px; color: rgb(0, 0, 0);"><br></span></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><span style="font-family: courier, "courier new", monospace; font-size: 14px; color: rgb(0, 0, 0);">I was wondering if anyone </span>managed to get this to work?</pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><br></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><br></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)">Cheers,</pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><br></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)">Esben</pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><br></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><br></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)">----------- gres.conf ---------------</pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><br></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><span style="font-family: courier, "courier new", monospace; font-size: 14px; color: rgb(0, 0, 0);">##################################################################</span><br></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><div># Slurm's Generic Resource (GRES) configuration file </div><div># Define GPU devices </div><div>##################################################################</div><div>#AutoDetect=nvml </div><div>Name=gpu Type=A40 File=/dev/nvidia[0-4]</div><span>Name=mps Count=500 File=/dev/nvidia[0-4]</span><br></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><br></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><span style="font-family: courier, "courier new", monospace; font-size: 14px; color: rgb(0, 0, 0);"><br></span></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><span style="font-family: courier, "courier new", monospace; font-size: 14px; color: rgb(0, 0, 0);">------------ slurm.conf ---------------</span></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><span style="font-family: courier, "courier new", monospace; font-size: 14px; color: rgb(0, 0, 0);"><br></span></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><span style="font-family: courier, "courier new", monospace; font-size: 14px; color: rgb(0, 0, 0);">SlurmctldHost=ai<div>NodeName=ai Boards=1 SocketsPerBoard=2 CoresPerSocket=48 ThreadsPerCore=2 Gres=gpu:A40:5,mps:500 Feature=ht,gpu,mps <br></div><div><br></div><div>PartitionName=debug Nodes=ai Default=YES MaxTime=INFINITE State=UP AllowGroups=ALL AllowAccounts=ALL</div><div><br></div><div>SlurmdUser=root</div><div>ClusterName=cluster</div><div><br></div><div>SelectType=select/cons_tres </div><div>SelectTypeParameters=CR_Core </div><div>JobAcctGatherType=jobacct_gather/cgroup</div><div><br></div><div>## GRES</div><div>GresTypes=gpu,mps</div><div>DebugFlags=CPU_Bind,gres<br></div><div><br></div><div><br></div></span></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><br></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><br></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)">> --------------------------</pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)">> Ransom, Geoffrey M. Thu, 09 Jan 2020 10:53:10 -0800<br></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)"><br></pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)">BLUF:
Is the Nvidia MPS service required for the MPS gres to function in slurm
with multiple GPUs in a single machine? (jobs using MPS don't need to span
GPUs, just use a part of a GPU in a machine with multiple GPUs)
Is there more detailed documentation available on how MPS should be set up
and how it functions?</pre>
<pre style="font-family:courier, "courier new", monospace;font-size:14px;overflow-wrap:break-word;margin:0px;background-color:rgb(255, 255, 255)">I'm playing with mps on a test machine and the documentation at
<a rel="nofollow" href="https://slurm.schedmd.com/gres.html" style="color:rgb(160, 30, 30)">https://slurm.schedmd.com/gres.html</a> seems a bit vague. It implies it can be
used across multiple GPUs, but then states that only one GPU per node may be
configured for use with MPS.
When I test mps in slurm without the NVIDIA MPS service (I am just starting to
read up on the NVIDIA MPS service now) it does seem to only use one GPU.
In gres.conf
NodeName=testmachine1 Name=gpu File=/dev/nvidia[0-1]
NodeName=testmachine1 Name=mps count=200 File=/dev/nvidia[0-1]
In slurm.conf
NodeName=testmachine1 Gres=gpu:2,mps:200 Sockets=1 CoresPerSocket=6
An array job posted with "-gres=mps:50" will put two job steps on the first
GPU, but doesn't use the second GPU for mps jobs.
Is the Nvidia MPS service required for the MPS gres to function in slurm?
Is there more detailed documentation available on how MPS should be set up and
how it functions?
We have a mixed set of work (shared GPU using 1 CPU core and a small percentage
of one GPU versus dedicated GPU jobs using a whole number of GPUs and CPUs) on
machines with 4 GPUs and it would be nice to have them co-exist instead of
splitting the machines into two separate partitions for the two styles of jobs.
Thanks.</pre>
<br>
</div>
</body>
</html>