[slurm-users] TensorRT script runs with srun but not from a sbatch file
Robert Kudyba
rkudyba at fordham.edu
Wed Apr 29 19:25:19 UTC 2020
I'm using this TensorRT tutorial
<https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/sampleMovieLensMPS>
with MPS on Slurm 20.02 on Bright Cluster 8.2
Here are the contents of my mpsmovietest sbatch file:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --job-name=MPSMovieTest
#SBATCH --gres=gpu:1
#SBATCH --nodelist=node001
#SBATCH --output=mpstest.out
export CUDA_VISIBLE_DEVICES=0
nvidia-smi -i 0
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d
module load shared slurm openmpi/cuda/64 cm-ml-python3deps/3.2.3
cudnn/7.0 slurm cuda10.1/toolkit ml-pythondeps-py36-cuda10.1-gcc/3.2.3
tensorflow-py36-cuda10.1-gcc tensorrt-cuda10.1-gcc/6.0.1.5 gcc gdb
keras-py36-cuda10.1-gcc nccl2-cuda10.1-gcc
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2
-p 2
When run in Slurm I get the below errors so perhaps there is a pathing
issue that does not work when I run srun alone:
Could not find movielens_ratings.txt in data directories:
data/samples/movielens/
data/movielens/
&&&& FAILED
I’m trying to use srun to test this but it always fails as it appears to be
trying all nodes. We only have 3 compute nodes. As I’m writing this node002
and node003 are in use by other users so I just want to use node001.
srun /home/mydir/mpsmovietest --gres=gpu:1 --job-name=MPSMovieTest
--nodes=1 --nodelist=node001 -Z --output=mpstest.out
Tue Apr 14 16:45:10 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off | 0 |
| N/A 67C P0 241W / 250W | 32167MiB / 32510MiB | 100% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 428996 C python3.6 32151MiB |
+-----------------------------------------------------------------------------+
Loading openmpi/cuda/64/3.1.4
Loading requirement: hpcx/2.4.0 gcc5/5.5.0
Loading cm-ml-python3deps/3.2.3
Loading requirement: python36
Loading tensorflow-py36-cuda10.1-gcc/1.15.2
Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20
keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0 nccl2-cuda10.1-gcc/2.5.6
&&&& RUNNING TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps
-b 2 -p 2
[03/14/2020-16:45:10] [I] ../../../data/movielens/movielens_ratings.txt
[E] [TRT] CUDA initialization failure with error 999. Please check
your CUDA installation:
http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
[E] Could not create builder.
[03/14/2020-16:45:10] [03/14/2020-16:45:10] &&&& FAILED
TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps
-b 2 -p 2
srun: error: node002: task 0: Exited with exit code 1
So is my syntax wrong with srun? MPS is running:
$ ps -auwx|grep mps
root 108581 0.0 0.0 12780 812 ? Ssl Mar23 0:54
/cm/local/apps/cuda-
When node002 is available the program runs correctly, albeit with an error
about the log file failing to write:
srun /home/mydir/mpsmovietest --gres=gpu:1 --job-name=MPSMovieTest
--nodes=1 --nodelist=node001 -Z --output=mpstest.out
Thu Apr 16 10:08:52 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2
|
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off |
0 |
| N/A 28C P0 25W / 250W | 41MiB / 32510MiB | 0% E.
Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU
Memory |
| GPU PID Type Process name Usage
|
|=============================================================================|
| 0 420596 C nvidia-cuda-mps-server
29MiB |
+-----------------------------------------------------------------------------+
Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs
will be available.
An instance of this daemon is already running
Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs
will be available.
Loading openmpi/cuda/64/3.1.4
Loading requirement: hpcx/2.4.0 gcc5/5.5.0
Loading cm-ml-python3deps/3.2.3
Loading requirement: python36
Loading tensorflow-py36-cuda10.1-gcc/1.15.2
Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20
keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0
nccl2-cuda10.1-gcc/2.5.6
&&&& RUNNING TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2
-p 2
[03/16/2020-10:08:52] [I] ../../../data/movielens/movielens_ratings.txt
[03/16/2020-10:08:53] [I] Begin parsing model...
[03/16/2020-10:08:53] [I] End parsing model...
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[03/16/2020-10:08:57] [03/16/2020-10:08:57] [I] [TRT] Detected 2 inputs and
3 output network tensors.
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[03/16/2020-10:08:57] [03/16/2020-10:08:57] [I] End building engine...
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[03/16/2020-10:09:01] [03/16/2020-10:09:01] [03/16/2020-10:09:01] [I] Done
execution in process: 99395 . Duration : 315.744 microseconds.
[03/16/2020-10:09:01] [I] Num of users : 2
[03/16/2020-10:09:01] [I] Num of Movies : 100
[03/16/2020-10:09:01] [I] | PID : 99395 | User : 0 | Expected Item :
128 | Predicted Item : 128 |
[03/16/2020-10:09:01] [I] | PID : 99395 | User : 1 | Expected Item :
133 | Predicted Item : 133 |
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[03/16/2020-10:09:01] [03/16/2020-10:09:01] [03/16/2020-10:09:01] [I] Done
execution in process: 99396 . Duration : 306.944 microseconds.
[03/16/2020-10:09:01] [I] Num of users : 2
[03/16/2020-10:09:01] [I] Num of Movies : 100
[03/16/2020-10:09:01] [I] | PID : 99396 | User : 0 | Expected Item :
128 | Predicted Item : 128 |
[03/16/2020-10:09:01] [I] | PID : 99396 | User : 1 | Expected Item :
133 | Predicted Item : 133 |
[03/16/2020-10:09:02] [I] Number of processes executed : 2. Total MPS Run
Duration : 4361.73 milliseconds.
&&&& PASSED TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2
-p 2
Is something incorrect in the sbatch file?
Thanks!
Rob
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200429/f8492ef4/attachment.htm>
More information about the slurm-users
mailing list