<div dir="ltr"><div class="gmail_quote"><div>I'm using <a href="https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/sampleMovieLensMPS" target="_blank">this TensorRT tutorial</a> with MPS on Slurm 20.02 on Bright Cluster 8.2  <br></div><div><br></div><div>Here are the contents of my mpsmovietest sbatch file:</div><font face="monospace">#!/bin/bash<br>#SBATCH --nodes=1<br>#SBATCH --job-name=MPSMovieTest<br>#SBATCH --gres=gpu:1<br>#SBATCH --nodelist=node001<br>#SBATCH --output=mpstest.out<br>export CUDA_VISIBLE_DEVICES=0<br>nvidia-smi -i 0<br>export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps<br>export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log<br>nvidia-cuda-mps-control -d<br>module load shared slurm  openmpi/cuda/64 cm-ml-python3deps/3.2.3  cudnn/7.0 slurm cuda10.1/toolkit ml-pythondeps-py36-cuda10.1-gcc/3.2.3 tensorflow-py36-cuda10.1-gcc tensorrt-cuda10.1-gcc/<a href="http://6.0.1.5" target="_blank">6.0.1.5</a> gcc gdb keras-py36-cuda10.1-gcc nccl2-cuda10.1-gcc<br>/cm/shared/apps/tensorrt-cuda10.1-gcc/<a href="http://6.0.1.5/bin/sample_movielens_mps" target="_blank">6.0.1.5/bin/sample_movielens_mps</a> -b 2 -p 2</font>  <br></div><div class="gmail_quote"><br></div><div class="gmail_quote">When run in Slurm I get the below errors so perhaps there is a pathing issue that does not work when I run <font face="monospace">srun </font>alone:</div><div class="gmail_quote"><font face="monospace">Could not find movielens_ratings.txt in data directories:<br>data/samples/movielens/<br>data/movielens/<br>&&&& FAILED</font></div><div class="gmail_quote"><div dir="ltr"><p style="margin-top:0px;color:rgb(51,51,51);font-family:DIN-Web-Pro,Helvetica,Arial,sans-serif;font-size:15px;letter-spacing:0.15px"></p><p style="margin-top:0px;color:rgb(51,51,51);font-family:DIN-Web-Pro,Helvetica,Arial,sans-serif;font-size:15px;letter-spacing:0.15px">I’m trying to use <code style="font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace;font-size:1em;background:rgb(248,248,248)">srun</code> to test this but it always fails as it appears to be trying all nodes. We only have 3 compute nodes. As I’m writing this <code style="font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace;font-size:1em;background:rgb(248,248,248)">node002</code> and <code style="font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace;font-size:1em;background:rgb(248,248,248)">node003</code> are in use by other users so I just want to use <code style="font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace;font-size:1em;background:rgb(248,248,248)">node001</code>.</p><pre style="overflow:auto;font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace;font-size:15px;color:rgb(51,51,51);letter-spacing:0.15px"><code style="font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace;font-size:1em;overflow:auto;background:rgb(249,249,249);display:block;padding:0.5em;max-height:500px">srun /home/mydir/mpsmovietest  --gres=gpu:1 --job-name=MPSMovieTest  --nodes=1 --nodelist=node001 -Z --output=mpstest.out
Tue Apr 14 16:45:10 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   67C    P0   241W / 250W |  32167MiB / 32510MiB |    100%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    428996      C   python3.6                                  32151MiB |
+-----------------------------------------------------------------------------+
Loading openmpi/cuda/64/3.1.4
  Loading requirement: hpcx/2.4.0 gcc5/5.5.0

Loading cm-ml-python3deps/3.2.3
  Loading requirement: python36

Loading tensorflow-py36-cuda10.1-gcc/1.15.2
  Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20
    keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0 nccl2-cuda10.1-gcc/2.5.6
&&&& RUNNING TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/<a href="http://6.0.1.5/bin/sample_movielens_mps" target="_blank">6.0.1.5/bin/sample_movielens_mps</a> -b 2 -p 2
[03/14/2020-16:45:10] [I] ../../../data/movielens/movielens_ratings.txt
[E] [TRT] CUDA initialization failure with error 999. Please check your CUDA installation:  <a href="http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html" target="_blank">http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html</a>
[E] Could not create builder.
[03/14/2020-16:45:10] [03/14/2020-16:45:10] &&&& FAILED TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/<a href="http://6.0.1.5/bin/sample_movielens_mps" target="_blank">6.0.1.5/bin/sample_movielens_mps</a> -b 2 -p 2
srun: error: node002: task 0: Exited with exit code 1
</code></pre><p style="color:rgb(51,51,51);font-family:DIN-Web-Pro,Helvetica,Arial,sans-serif;font-size:15px;letter-spacing:0.15px">So is my syntax wrong with <code style="font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace;font-size:1em;background:rgb(248,248,248)">srun</code>? MPS is running:</p><pre style="overflow:auto;font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace;font-size:15px;color:rgb(51,51,51);letter-spacing:0.15px"><code style="font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace;font-size:1em;overflow:auto;background:rgb(249,249,249);display:block;padding:0.5em;max-height:500px">$ ps -auwx|grep mps
root     108581  0.0  0.0  12780   812 ?        Ssl  Mar23   0:54 /cm/local/apps/cuda-</code></pre><div>When node002 is available the program runs correctly, albeit with an error about the log file failing to write:</div><div><br></div><div><font face="monospace">srun /home/mydir/mpsmovietest  --gres=gpu:1 --job-name=MPSMovieTest  --nodes=1 --nodelist=node001 -Z --output=mpstest.out<br>Thu Apr 16 10:08:52 2020<br>+-----------------------------------------------------------------------------+<br>| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |<br>|-------------------------------+----------------------+----------------------+<br>| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |<br>| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |<br>|===============================+======================+======================|<br>|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |<br>| N/A   28C    P0    25W / 250W |     41MiB / 32510MiB |      0%   E. Process |<br>+-------------------------------+----------------------+----------------------+<br><br>+-----------------------------------------------------------------------------+<br>| Processes:                                                       GPU Memory |<br>|  GPU       PID   Type   Process name                             Usage      |<br>|=============================================================================|<br>|    0    420596      C   nvidia-cuda-mps-server                        29MiB |<br>+-----------------------------------------------------------------------------+<br>Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs will be available.<br>An instance of this daemon is already running<br>Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs will be available.<br>Loading openmpi/cuda/64/3.1.4<br>  Loading requirement: hpcx/2.4.0 gcc5/5.5.0<br><br>Loading cm-ml-python3deps/3.2.3<br>  Loading requirement: python36<br><br>Loading tensorflow-py36-cuda10.1-gcc/1.15.2<br>  Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20<br>    keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0 nccl2-cuda10.1-gcc/2.5.6<br>&&&& RUNNING TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/<a href="http://6.0.1.5/bin/sample_movielens_mps" target="_blank">6.0.1.5/bin/sample_movielens_mps</a> -b 2 -p 2<br>[03/16/2020-10:08:52] [I] ../../../data/movielens/movielens_ratings.txt<br>[03/16/2020-10:08:53] [I] Begin parsing model...<br>[03/16/2020-10:08:53] [I] End parsing model...<br>[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5<br>[03/16/2020-10:08:57] [03/16/2020-10:08:57] [I] [TRT] Detected 2 inputs and 3 output network tensors.<br>[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5<br>[03/16/2020-10:08:57] [03/16/2020-10:08:57] [I] End building engine...<br>[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5<br>[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5<br>[03/16/2020-10:09:01] [03/16/2020-10:09:01] [03/16/2020-10:09:01] [I] Done execution in process: 99395 . Duration : 315.744 microseconds.<br>[03/16/2020-10:09:01] [I] Num of users : 2<br>[03/16/2020-10:09:01] [I] Num of Movies : 100<br>[03/16/2020-10:09:01] [I] | PID : 99395 | User :   0  |  Expected Item :  128  |  Predicted Item :  128 |<br>[03/16/2020-10:09:01] [I] | PID : 99395 | User :   1  |  Expected Item :  133  |  Predicted Item :  133 |<br>[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5<br>[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5<br>[03/16/2020-10:09:01] [03/16/2020-10:09:01] [03/16/2020-10:09:01] [I] Done execution in process: 99396 . Duration : 306.944 microseconds.<br>[03/16/2020-10:09:01] [I] Num of users : 2<br>[03/16/2020-10:09:01] [I] Num of Movies : 100<br>[03/16/2020-10:09:01] [I] | PID : 99396 | User :   0  |  Expected Item :  128  |  Predicted Item :  128 |<br>[03/16/2020-10:09:01] [I] | PID : 99396 | User :   1  |  Expected Item :  133  |  Predicted Item :  133 |<br>[03/16/2020-10:09:02] [I] Number of processes executed : 2. Total MPS Run Duration : 4361.73 milliseconds.<br>&&&& PASSED TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/<a href="http://6.0.1.5/bin/sample_movielens_mps" target="_blank">6.0.1.5/bin/sample_movielens_mps</a> -b 2 -p 2<br></font></div><div><br></div>Is something incorrect in the sbatch file?</div><div dir="ltr"><br></div><div>Thanks!</div><div><br></div><div>Rob</div>
</div></div>