<div dir="ltr"><div class="gmail_quote"><div>I'm using <a href="https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/sampleMovieLensMPS" target="_blank">this TensorRT tutorial</a> with MPS on Slurm 20.02 on Bright Cluster 8.2 <br></div><div><br></div><div>Here are the contents of my mpsmovietest sbatch file:</div><font face="monospace">#!/bin/bash<br>#SBATCH --nodes=1<br>#SBATCH --job-name=MPSMovieTest<br>#SBATCH --gres=gpu:1<br>#SBATCH --nodelist=node001<br>#SBATCH --output=mpstest.out<br>export CUDA_VISIBLE_DEVICES=0<br>nvidia-smi -i 0<br>export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps<br>export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log<br>nvidia-cuda-mps-control -d<br>module load shared slurm openmpi/cuda/64 cm-ml-python3deps/3.2.3 cudnn/7.0 slurm cuda10.1/toolkit ml-pythondeps-py36-cuda10.1-gcc/3.2.3 tensorflow-py36-cuda10.1-gcc tensorrt-cuda10.1-gcc/<a href="http://6.0.1.5" target="_blank">6.0.1.5</a> gcc gdb keras-py36-cuda10.1-gcc nccl2-cuda10.1-gcc<br>/cm/shared/apps/tensorrt-cuda10.1-gcc/<a href="http://6.0.1.5/bin/sample_movielens_mps" target="_blank">6.0.1.5/bin/sample_movielens_mps</a> -b 2 -p 2</font> <br></div><div class="gmail_quote"><br></div><div class="gmail_quote">When run in Slurm I get the below errors so perhaps there is a pathing issue that does not work when I run <font face="monospace">srun </font>alone:</div><div class="gmail_quote"><font face="monospace">Could not find movielens_ratings.txt in data directories:<br>data/samples/movielens/<br>data/movielens/<br>&&&& FAILED</font></div><div class="gmail_quote"><div dir="ltr"><p style="margin-top:0px;color:rgb(51,51,51);font-family:DIN-Web-Pro,Helvetica,Arial,sans-serif;font-size:15px;letter-spacing:0.15px"></p><p style="margin-top:0px;color:rgb(51,51,51);font-family:DIN-Web-Pro,Helvetica,Arial,sans-serif;font-size:15px;letter-spacing:0.15px">I’m trying to use <code style="font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace;font-size:1em;background:rgb(248,248,248)">srun</code> to test this but it always fails as it appears to be trying all nodes. We only have 3 compute nodes. As I’m writing this <code style="font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace;font-size:1em;background:rgb(248,248,248)">node002</code> and <code style="font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace;font-size:1em;background:rgb(248,248,248)">node003</code> are in use by other users so I just want to use <code style="font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace;font-size:1em;background:rgb(248,248,248)">node001</code>.</p><pre style="overflow:auto;font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace;font-size:15px;color:rgb(51,51,51);letter-spacing:0.15px"><code style="font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace;font-size:1em;overflow:auto;background:rgb(249,249,249);display:block;padding:0.5em;max-height:500px">srun /home/mydir/mpsmovietest --gres=gpu:1 --job-name=MPSMovieTest --nodes=1 --nodelist=node001 -Z --output=mpstest.out
Tue Apr 14 16:45:10 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off | 0 |
| N/A 67C P0 241W / 250W | 32167MiB / 32510MiB | 100% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 428996 C python3.6 32151MiB |
+-----------------------------------------------------------------------------+
Loading openmpi/cuda/64/3.1.4
Loading requirement: hpcx/2.4.0 gcc5/5.5.0
Loading cm-ml-python3deps/3.2.3
Loading requirement: python36
Loading tensorflow-py36-cuda10.1-gcc/1.15.2
Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20
keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0 nccl2-cuda10.1-gcc/2.5.6
&&&& RUNNING TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/<a href="http://6.0.1.5/bin/sample_movielens_mps" target="_blank">6.0.1.5/bin/sample_movielens_mps</a> -b 2 -p 2
[03/14/2020-16:45:10] [I] ../../../data/movielens/movielens_ratings.txt
[E] [TRT] CUDA initialization failure with error 999. Please check your CUDA installation: <a href="http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html" target="_blank">http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html</a>
[E] Could not create builder.
[03/14/2020-16:45:10] [03/14/2020-16:45:10] &&&& FAILED TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/<a href="http://6.0.1.5/bin/sample_movielens_mps" target="_blank">6.0.1.5/bin/sample_movielens_mps</a> -b 2 -p 2
srun: error: node002: task 0: Exited with exit code 1
</code></pre><p style="color:rgb(51,51,51);font-family:DIN-Web-Pro,Helvetica,Arial,sans-serif;font-size:15px;letter-spacing:0.15px">So is my syntax wrong with <code style="font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace;font-size:1em;background:rgb(248,248,248)">srun</code>? MPS is running:</p><pre style="overflow:auto;font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace;font-size:15px;color:rgb(51,51,51);letter-spacing:0.15px"><code style="font-family:Consolas,Menlo,Monaco,"Lucida Console","Liberation Mono","DejaVu Sans Mono","Bitstream Vera Sans Mono","Courier New",monospace;font-size:1em;overflow:auto;background:rgb(249,249,249);display:block;padding:0.5em;max-height:500px">$ ps -auwx|grep mps
root 108581 0.0 0.0 12780 812 ? Ssl Mar23 0:54 /cm/local/apps/cuda-</code></pre><div>When node002 is available the program runs correctly, albeit with an error about the log file failing to write:</div><div><br></div><div><font face="monospace">srun /home/mydir/mpsmovietest --gres=gpu:1 --job-name=MPSMovieTest --nodes=1 --nodelist=node001 -Z --output=mpstest.out<br>Thu Apr 16 10:08:52 2020<br>+-----------------------------------------------------------------------------+<br>| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |<br>|-------------------------------+----------------------+----------------------+<br>| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |<br>| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |<br>|===============================+======================+======================|<br>| 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off | 0 |<br>| N/A 28C P0 25W / 250W | 41MiB / 32510MiB | 0% E. Process |<br>+-------------------------------+----------------------+----------------------+<br><br>+-----------------------------------------------------------------------------+<br>| Processes: GPU Memory |<br>| GPU PID Type Process name Usage |<br>|=============================================================================|<br>| 0 420596 C nvidia-cuda-mps-server 29MiB |<br>+-----------------------------------------------------------------------------+<br>Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs will be available.<br>An instance of this daemon is already running<br>Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs will be available.<br>Loading openmpi/cuda/64/3.1.4<br> Loading requirement: hpcx/2.4.0 gcc5/5.5.0<br><br>Loading cm-ml-python3deps/3.2.3<br> Loading requirement: python36<br><br>Loading tensorflow-py36-cuda10.1-gcc/1.15.2<br> Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20<br> keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0 nccl2-cuda10.1-gcc/2.5.6<br>&&&& RUNNING TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/<a href="http://6.0.1.5/bin/sample_movielens_mps" target="_blank">6.0.1.5/bin/sample_movielens_mps</a> -b 2 -p 2<br>[03/16/2020-10:08:52] [I] ../../../data/movielens/movielens_ratings.txt<br>[03/16/2020-10:08:53] [I] Begin parsing model...<br>[03/16/2020-10:08:53] [I] End parsing model...<br>[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5<br>[03/16/2020-10:08:57] [03/16/2020-10:08:57] [I] [TRT] Detected 2 inputs and 3 output network tensors.<br>[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5<br>[03/16/2020-10:08:57] [03/16/2020-10:08:57] [I] End building engine...<br>[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5<br>[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5<br>[03/16/2020-10:09:01] [03/16/2020-10:09:01] [03/16/2020-10:09:01] [I] Done execution in process: 99395 . Duration : 315.744 microseconds.<br>[03/16/2020-10:09:01] [I] Num of users : 2<br>[03/16/2020-10:09:01] [I] Num of Movies : 100<br>[03/16/2020-10:09:01] [I] | PID : 99395 | User : 0 | Expected Item : 128 | Predicted Item : 128 |<br>[03/16/2020-10:09:01] [I] | PID : 99395 | User : 1 | Expected Item : 133 | Predicted Item : 133 |<br>[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5<br>[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5<br>[03/16/2020-10:09:01] [03/16/2020-10:09:01] [03/16/2020-10:09:01] [I] Done execution in process: 99396 . Duration : 306.944 microseconds.<br>[03/16/2020-10:09:01] [I] Num of users : 2<br>[03/16/2020-10:09:01] [I] Num of Movies : 100<br>[03/16/2020-10:09:01] [I] | PID : 99396 | User : 0 | Expected Item : 128 | Predicted Item : 128 |<br>[03/16/2020-10:09:01] [I] | PID : 99396 | User : 1 | Expected Item : 133 | Predicted Item : 133 |<br>[03/16/2020-10:09:02] [I] Number of processes executed : 2. Total MPS Run Duration : 4361.73 milliseconds.<br>&&&& PASSED TensorRT.sample_movielens_mps # /cm/shared/apps/tensorrt-cuda10.1-gcc/<a href="http://6.0.1.5/bin/sample_movielens_mps" target="_blank">6.0.1.5/bin/sample_movielens_mps</a> -b 2 -p 2<br></font></div><div><br></div>Is something incorrect in the sbatch file?</div><div dir="ltr"><br></div><div>Thanks!</div><div><br></div><div>Rob</div>
</div></div>