[slurm-users] srun always uses node002 even using --nodelist=node001

Thu Apr 16 14:20:28 UTC 2020

I'm using this TensorRT tutorial
<https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/sampleMovieLensMPS>
with MPS on Slurm 20.02 on Bright Cluster 8.2

I’m trying to use srun to test this but it always fails as it appears to be
trying all nodes. We only have 3 compute nodes. As I’m writing this node002
 and node003 are in use by other users so I just want to use node001.

srun /home/mydir/mpsmovietest  --gres=gpu:1 --job-name=MPSMovieTest
--nodes=1 --nodelist=node001 -Z --output=mpstest.out
Tue Apr 14 16:45:10 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |
| N/A   67C    P0   241W / 250W |  32167MiB / 32510MiB |    100%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    428996      C   python3.6                                  32151MiB |
+-----------------------------------------------------------------------------+
Loading openmpi/cuda/64/3.1.4
  Loading requirement: hpcx/2.4.0 gcc5/5.5.0

Loading cm-ml-python3deps/3.2.3
  Loading requirement: python36

Loading tensorflow-py36-cuda10.1-gcc/1.15.2
  Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20
    keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0 nccl2-cuda10.1-gcc/2.5.6
&&&& RUNNING TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps
-b 2 -p 2
[03/14/2020-16:45:10] [I] ../../../data/movielens/movielens_ratings.txt
[E] [TRT] CUDA initialization failure with error 999. Please check
your CUDA installation:
http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
[E] Could not create builder.
[03/14/2020-16:45:10] [03/14/2020-16:45:10] &&&& FAILED
TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps
-b 2 -p 2
srun: error: node002: task 0: Exited with exit code 1

So is my syntax wrong with srun? MPS is running:

$ ps -auwx|grep mps
root     108581  0.0  0.0  12780   812 ?        Ssl  Mar23   0:54
/cm/local/apps/cuda-

When node002 is available the program runs correctly, albeit with an error
on the log file:

srun /home/mydir/mpsmovietest  --gres=gpu:1 --job-name=MPSMovieTest
 --nodes=1 --nodelist=node001 -Z --output=mpstest.out
Thu Apr 16 10:08:52 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2
  |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr.
ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute
M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |
 0 |
| N/A   28C    P0    25W / 250W |     41MiB / 32510MiB |      0%   E.
Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU
Memory |
|  GPU       PID   Type   Process name                             Usage
   |
|=============================================================================|
|    0    420596      C   nvidia-cuda-mps-server
 29MiB |
+-----------------------------------------------------------------------------+
Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs
will be available.
An instance of this daemon is already running
Warning: Failed writing log files to directory [/tmp/nvidia-log]. No logs
will be available.
Loading openmpi/cuda/64/3.1.4
  Loading requirement: hpcx/2.4.0 gcc5/5.5.0

Loading cm-ml-python3deps/3.2.3
  Loading requirement: python36

Loading tensorflow-py36-cuda10.1-gcc/1.15.2
  Loading requirement: openblas/dynamic/0.2.20 hdf5_18/1.8.20
    keras-py36-cuda10.1-gcc/2.3.1 protobuf3-gcc/3.8.0
nccl2-cuda10.1-gcc/2.5.6
&&&& RUNNING TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2
-p 2
[03/16/2020-10:08:52] [I] ../../../data/movielens/movielens_ratings.txt
[03/16/2020-10:08:53] [I] Begin parsing model...
[03/16/2020-10:08:53] [I] End parsing model...
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[03/16/2020-10:08:57] [03/16/2020-10:08:57] [I] [TRT] Detected 2 inputs and
3 output network tensors.
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[03/16/2020-10:08:57] [03/16/2020-10:08:57] [I] End building engine...
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[03/16/2020-10:09:01] [03/16/2020-10:09:01] [03/16/2020-10:09:01] [I] Done
execution in process: 99395 . Duration : 315.744 microseconds.
[03/16/2020-10:09:01] [I] Num of users : 2
[03/16/2020-10:09:01] [I] Num of Movies : 100
[03/16/2020-10:09:01] [I] | PID : 99395 | User :   0  |  Expected Item :
 128  |  Predicted Item :  128 |
[03/16/2020-10:09:01] [I] | PID : 99395 | User :   1  |  Expected Item :
 133  |  Predicted Item :  133 |
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[W] [TRT] TensorRT was linked against cuDNN 7.6.3 but loaded cuDNN 7.0.5
[03/16/2020-10:09:01] [03/16/2020-10:09:01] [03/16/2020-10:09:01] [I] Done
execution in process: 99396 . Duration : 306.944 microseconds.
[03/16/2020-10:09:01] [I] Num of users : 2
[03/16/2020-10:09:01] [I] Num of Movies : 100
[03/16/2020-10:09:01] [I] | PID : 99396 | User :   0  |  Expected Item :
 128  |  Predicted Item :  128 |
[03/16/2020-10:09:01] [I] | PID : 99396 | User :   1  |  Expected Item :
 133  |  Predicted Item :  133 |
[03/16/2020-10:09:02] [I] Number of processes executed : 2. Total MPS Run
Duration : 4361.73 milliseconds.
&&&& PASSED TensorRT.sample_movielens_mps #
/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2
-p 2

Here are the contents of the mpsmovietest sbatch file:
#!/bin/bash

#SBATCH --nodes=1
#SBATCH --job-name=MPSMovieTest
#SBATCH --gres=gpu:1
#SBATCH --nodelist=node001
#SBATCH --output=mpstest.out
export CUDA_VISIBLE_DEVICES=0
nvidia-smi -i 0
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
nvidia-cuda-mps-control -d
module load shared slurm  openmpi/cuda/64 cm-ml-python3deps/3.2.3
 cudnn/7.0 slurm cuda10.1/toolkit ml-pythondeps-py36-cuda10.1-gcc/3.2.3
tensorflow-py36-cuda10.1-gcc tensorrt-cuda10.1-gcc/6.0.1.5 gcc gdb
keras-py36-cuda10.1-gcc nccl2-cuda10.1-gcc

/cm/shared/apps/tensorrt-cuda10.1-gcc/6.0.1.5/bin/sample_movielens_mps -b 2
-p 2
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200416/ec4c65af/attachment.htm>