[slurm-users] srun always uses node002 even using --nodelist=node001
Robert Kudyba
rkudyba at fordham.edu
Thu Apr 16 14:20:28 UTC 2020
I'm using this TensorRT tutorial
<https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/sampleMovieLensMPS>
with MPS on Slurm 20.02 on Bright Cluster 8.2
I’m trying to use srun to test this but it always fails as it appears to be
trying all nodes. We only have 3 compute nodes. As I’m writing this node002
and node003 are in use by other users so I just want to use node001.
srun /home/mydir/mpsmovietest --gres=gpu:1 --job-name=MPSMovieTest
--nodes=1 --nodelist=node001 -Z --output=mpstest.out
Tue Apr 14 16:45:10 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off | 0 |
| N/A 67C P0 241W / 250W | 32167MiB / 32510MiB | 100% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 428996 C python3.6 32151MiB |
+-----------------------------------------------------------------------------+
Loading openmpi/cuda/64/3.1.4
Loading requirement: hpcx/2.4.0 gcc5/5.5.0
Loading cm-ml-python3deps/3.2.3
Loading requirement: python36
Loading tensorflow-py36-cuda10.1-gcc/1.15.2
Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20
keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0 nccl2-cuda10.1-gcc/2.5.6
&&&& RUNNING TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps
-b 2 -p 2
[03/14/2020-16:45:10] [I] ../../../data/movielens/movielens_ratings.txt
[E] [TRT] CUDA initialization failure with error 999. Please check
your CUDA installation:
http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
[E] Could not create builder.
[03/14/2020-16:45:10] [03/14/2020-16:45:10] &&&& FAILED
TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps
-b 2 -p 2
srun: error: node002: task 0: Exited with exit code 1
So is my syntax wrong with srun? MPS is running:
$ ps -auwx|grep mps
root 108581 0.0 0.0 12780 812 ? Ssl Mar23 0:54
/cm/local/apps/cuda-
When node002 is available the program runs correctly, albeit with an error
on the log file:
srun /home/mydir/mpsmovietest --gres=gpu:1 --job-name=MPSMovieTest
--nodes=1 --nodelist=node001 -Z --output=mpstest.out
Thu Apr 16 10:08:52 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2
|
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off |
0 |
| N/A 28C P0 25W / 250W | 41MiB / 32510MiB | 0% E.
Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU
Memory |
| GPU PID Type Process name Usage
|
|=============================================================================|
| 0 420596 C nvidia-cuda-mps-server
29MiB |
+-----------------------------------------------------------------------------+
Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs
will be available.
An instance of this daemon is already running
Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs
will be available.
Loading openmpi/cuda/64/3.1.4
Loading requirement: hpcx/2.4.0 gcc5/5.5.0
Loading cm-ml-python3deps/3.2.3
Loading requirement: python36
Loading tensorflow-py36-cuda10.1-gcc/1.15.2
Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20
keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0
nccl2-cuda10.1-gcc/2.5.6
&&&& RUNNING TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2
-p 2
[03/16/2020-10:08:52] [I] ../../../data/movielens/movielens_ratings.txt
[03/16/2020-10:08:53] [I] Begin parsing model...
[03/16/2020-10:08:53] [I] End parsing model...
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[03/16/2020-10:08:57] [03/16/2020-10:08:57] [I] [TRT] Detected 2 inputs and
3 output network tensors.
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[03/16/2020-10:08:57] [03/16/2020-10:08:57] [I] End building engine...
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[03/16/2020-10:09:01] [03/16/2020-10:09:01] [03/16/2020-10:09:01] [I] Done
execution in process: 99395 . Duration : 315.744 microseconds.
[03/16/2020-10:09:01] [I] Num of users : 2
[03/16/2020-10:09:01] [I] Num of Movies : 100
[03/16/2020-10:09:01] [I] | PID : 99395 | User : 0 | Expected Item :
128 | Predicted Item : 128 |
[03/16/2020-10:09:01] [I] | PID : 99395 | User : 1 | Expected Item :
133 | Predicted Item : 133 |
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[03/16/2020-10:09:01] [03/16/2020-10:09:01] [03/16/2020-10:09:01] [I] Done
execution in process: 99396 . Duration : 306.944 microseconds.
[03/16/2020-10:09:01] [I] Num of users : 2
[03/16/2020-10:09:01] [I] Num of Movies : 100
[03/16/2020-10:09:01] [I] | PID : 99396 | User : 0 | Expected Item :
128 | Predicted Item : 128 |
[03/16/2020-10:09:01] [I] | PID : 99396 | User : 1 | Expected Item :
133 | Predicted Item : 133 |
[03/16/2020-10:09:02] [I] Number of processes executed : 2. Total MPS Run
Duration : 4361.73 milliseconds.
&&&& PASSED TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2
-p 2
Here are the contents of the mpsmovietest sbatch file:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --job-name=MPSMovieTest
#SBATCH --gres=gpu:1
#SBATCH --nodelist=node001
#SBATCH --output=mpstest.out
export CUDA_VISIBLE_DEVICES=0
nvidia-smi -i 0
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d
module load shared slurm openmpi/cuda/64 cm-ml-python3deps/3.2.3
cudnn/7.0 slurm cuda10.1/toolkit ml-pythondeps-py36-cuda10.1-gcc/3.2.3
tensorflow-py36-cuda10.1-gcc tensorrt-cuda10.1-gcc/6.0.1.5 gcc gdb
keras-py36-cuda10.1-gcc nccl2-cuda10.1-gcc
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2
-p 2
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200416/ec4c65af/attachment.htm>
More information about the slurm-users
mailing list