[slurm-users] TensorRT script runs with srun but not from a sbatch file

Wed Apr 29 19:25:19 UTC 2020

I'm using this TensorRT tutorial
<https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/sampleMovieLensMPS>
with MPS on Slurm 20.02 on Bright Cluster 8.2

Here are the contents of my mpsmovietest sbatch file:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --job-name=MPSMovieTest
#SBATCH --gres=gpu:1
#SBATCH --nodelist=node001
#SBATCH --output=mpstest.out
export CUDA_VISIBLE_DEVICES=0
nvidia-smi -i 0
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d
module load shared slurm  openmpi/cuda/64 cm-ml-python3deps/3.2.3
 cudnn/7.0 slurm cuda10.1/toolkit ml-pythondeps-py36-cuda10.1-gcc/3.2.3
tensorflow-py36-cuda10.1-gcc tensorrt-cuda10.1-gcc/6.0.1.5 gcc gdb
keras-py36-cuda10.1-gcc nccl2-cuda10.1-gcc
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2
-p 2

When run in Slurm I get the below errors so perhaps there is a pathing
issue that does not work when I run srun alone:
Could not find movielens_ratings.txt in data directories:
data/samples/movielens/
data/movielens/
&&&& FAILED

I’m trying to use srun to test this but it always fails as it appears to be
trying all nodes. We only have 3 compute nodes. As I’m writing this node002
 and node003 are in use by other users so I just want to use node001.

srun /home/mydir/mpsmovietest  --gres=gpu:1 --job-name=MPSMovieTest
--nodes=1 --nodelist=node001 -Z --output=mpstest.out
Tue Apr 14 16:45:10 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   67C    P0   241W / 250W |  32167MiB / 32510MiB |    100%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    428996      C   python3.6                                  32151MiB |
+-----------------------------------------------------------------------------+
Loading openmpi/cuda/64/3.1.4
  Loading requirement: hpcx/2.4.0 gcc5/5.5.0

Loading cm-ml-python3deps/3.2.3
  Loading requirement: python36

Loading tensorflow-py36-cuda10.1-gcc/1.15.2
  Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20
    keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0 nccl2-cuda10.1-gcc/2.5.6
&&&& RUNNING TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps
-b 2 -p 2
[03/14/2020-16:45:10] [I] ../../../data/movielens/movielens_ratings.txt
[E] [TRT] CUDA initialization failure with error 999. Please check
your CUDA installation:
http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
[E] Could not create builder.
[03/14/2020-16:45:10] [03/14/2020-16:45:10] &&&& FAILED
TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps
-b 2 -p 2
srun: error: node002: task 0: Exited with exit code 1

So is my syntax wrong with srun? MPS is running:

$ ps -auwx|grep mps
root     108581  0.0  0.0  12780   812 ?        Ssl  Mar23   0:54
/cm/local/apps/cuda-

When node002 is available the program runs correctly, albeit with an error
about the log file failing to write:

srun /home/mydir/mpsmovietest  --gres=gpu:1 --job-name=MPSMovieTest
 --nodes=1 --nodelist=node001 -Z --output=mpstest.out
Thu Apr 16 10:08:52 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2
  |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr.
ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute
M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |
 0 |
| N/A   28C    P0    25W / 250W |     41MiB / 32510MiB |      0%   E.
Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU
Memory |
|  GPU       PID   Type   Process name                             Usage
   |
|=============================================================================|
|    0    420596      C   nvidia-cuda-mps-server
 29MiB |
+-----------------------------------------------------------------------------+
Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs
will be available.
An instance of this daemon is already running
Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs
will be available.
Loading openmpi/cuda/64/3.1.4
  Loading requirement: hpcx/2.4.0 gcc5/5.5.0

Loading cm-ml-python3deps/3.2.3
  Loading requirement: python36

Loading tensorflow-py36-cuda10.1-gcc/1.15.2
  Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20
    keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0
nccl2-cuda10.1-gcc/2.5.6
&&&& RUNNING TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2
-p 2
[03/16/2020-10:08:52] [I] ../../../data/movielens/movielens_ratings.txt
[03/16/2020-10:08:53] [I] Begin parsing model...
[03/16/2020-10:08:53] [I] End parsing model...
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[03/16/2020-10:08:57] [03/16/2020-10:08:57] [I] [TRT] Detected 2 inputs and
3 output network tensors.
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[03/16/2020-10:08:57] [03/16/2020-10:08:57] [I] End building engine...
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[03/16/2020-10:09:01] [03/16/2020-10:09:01] [03/16/2020-10:09:01] [I] Done
execution in process: 99395 . Duration : 315.744 microseconds.
[03/16/2020-10:09:01] [I] Num of users : 2
[03/16/2020-10:09:01] [I] Num of Movies : 100
[03/16/2020-10:09:01] [I] | PID : 99395 | User :   0  |  Expected Item :
 128  |  Predicted Item :  128 |
[03/16/2020-10:09:01] [I] | PID : 99395 | User :   1  |  Expected Item :
 133  |  Predicted Item :  133 |
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[03/16/2020-10:09:01] [03/16/2020-10:09:01] [03/16/2020-10:09:01] [I] Done
execution in process: 99396 . Duration : 306.944 microseconds.
[03/16/2020-10:09:01] [I] Num of users : 2
[03/16/2020-10:09:01] [I] Num of Movies : 100
[03/16/2020-10:09:01] [I] | PID : 99396 | User :   0  |  Expected Item :
 128  |  Predicted Item :  128 |
[03/16/2020-10:09:01] [I] | PID : 99396 | User :   1  |  Expected Item :
 133  |  Predicted Item :  133 |
[03/16/2020-10:09:02] [I] Number of processes executed : 2. Total MPS Run
Duration : 4361.73 milliseconds.
&&&& PASSED TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2
-p 2

Is something incorrect in the sbatch file?

Thanks!

Rob
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200429/f8492ef4/attachment.htm>