<div dir="ltr"><div><font face="monospace">Running Slurm 20.02 on Centos 7.7 with Bright Cluster 8.2. I'm wondering how the below sbatch file is sharing a GPU. </font></div><div><font face="monospace"><br></font></div><div><font face="monospace">MPS is running on the head node:</font></div><div>ps -auwx|grep mps<br>root     108581  0.0  0.0  12780   812 ?        Ssl  Mar23   0:27 /cm/local/apps/cuda-driver/libs/440.33.01/bin/nvidia-cuda-mps-control -d<br></div><font face="monospace"><div><font face="monospace"><br></font></div><div><font face="monospace">The entire script is posted on <a href="https://stackoverflow.com/questions/34709749/how-do-i-use-nvidia-multi-process-service-mps-to-run-multiple-non-mpi-cuda-app">SO here</a>. </font></div><div><font face="monospace"><br></font></div><div><font face="monospace">Here is the sbatch file contents:</font></div><div><font face="monospace"><br></font></div>#!/bin/sh<br>#SBATCH -N 1<br>#SBATCH -n 1<br>#SBATCH --job-name=sequentialBlur_alexnet_training_imagewoof_crossval<br>#SBATCH --nodelist=node003<br>module purge<br>module load gcc5 cuda10.1<br>module load openmpi/cuda/64<br>module load pytorch-py36-cuda10.1-gcc<br>module load ml-pythondeps-py36-cuda10.1-gcc<br>python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof $1 | tee alex_100_imwoof_seq_longtrain_cv_$1.txt</font><br><div><br></div><div>From nvidia-smi on the compute node:</div><div><font face="monospace">    Processes<br>        Process ID                  : 320467<br>            Type                    : C<br>            Name                    : python3.6<br>            Used GPU Memory         : 2369 MiB<br>        Process ID                  : 320574<br>            Type                    : C<br>            Name                    : python3.6<br>            Used GPU Memory         : 2369 MiB<br><br>[node003 ~]# nvidia-smi -q -d compute<br><br>==============NVSMI LOG==============<br><br>Timestamp                           : Fri Apr  3 15:27:49 2020<br>Driver Version                      : 440.33.01<br>CUDA Version                        : 10.2<br><br>Attached GPUs                       : 1<br>GPU 00000000:3B:00.0<br>    Compute Mode                    : Default<br><br></font></div><div><font face="monospace"><br></font></div><div><font face="monospace">[~]# nvidia-smi<br>Fri Apr  3 15:28:49 2020<br>+-----------------------------------------------------------------------------+<br>| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |<br>|-------------------------------+----------------------+----------------------+<br>| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |<br>| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |<br>|===============================+======================+======================|<br>|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |                    0 |<br>| N/A   42C    P0    46W / 250W |   4750MiB / 32510MiB |     32%      Default |<br>+-------------------------------+----------------------+----------------------+<br><br>+-----------------------------------------------------------------------------+<br>| Processes:                                                       GPU Memory |<br>|  GPU       PID   Type   Process name                             Usage      |<br>|=============================================================================|<br>|    0    320467      C   python3.6                                   2369MiB |<br>|    0    320574      C   python3.6                                   2369MiB |<br>+-----------------------------------------------------------------------------+</font><br></div><div><font face="monospace"><br></font></div><div><font face="monospace">From htop:</font></div><div>320574 ouruser 20   0 12.2G 1538M  412M R 502.  0.8 14h45:59 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1<br>320467 


ouruser 


20   0 12.2G 1555M  412M D 390.  0.8 14h45:13 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0<br>320654 


ouruser 


20   0 12.2G 1555M  412M R 111.  0.8  3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0<br>320656 


ouruser 


20   0 12.2G 1555M  412M R 111.  0.8  3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0<br>320658 


ouruser 


20   0 12.2G 1538M  412M R 111.  0.8  3h00:54 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1<br>320660 


ouruser 


20   0 12.2G 1538M  412M R 111.  0.8  3h00:53 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1<br>320661 


ouruser 


20   0 12.2G 1538M  412M R 111.  0.8  3h00:54 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1<br>320655 


ouruser 


20   0 12.2G 1555M  412M R 55.8  0.8  3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0<br>320657 


ouruser 


20   0 12.2G 1555M  412M R 55.8  0.8  3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0<br>320659 


ouruser 


20   0 12.2G 1538M  412M R 55.8  0.8  3h00:53 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1<br></div><div><br></div><div>Is PyTorch somehow working around Slurm and NOT locking a GPU since the user omitted --gres=gpu:1? How can I tell if MPS is really working?</div></div>