<div dir="ltr"><div><font face="monospace">Running Slurm 20.02 on Centos 7.7 with Bright Cluster 8.2. I'm wondering how the below sbatch file is sharing a GPU. </font></div><div><font face="monospace"><br></font></div><div><font face="monospace">MPS is running on the head node:</font></div><div>ps -auwx|grep mps<br>root 108581 0.0 0.0 12780 812 ? Ssl Mar23 0:27 /cm/local/apps/cuda-driver/libs/440.33.01/bin/nvidia-cuda-mps-control -d<br></div><font face="monospace"><div><font face="monospace"><br></font></div><div><font face="monospace">The entire script is posted on <a href="https://stackoverflow.com/questions/34709749/how-do-i-use-nvidia-multi-process-service-mps-to-run-multiple-non-mpi-cuda-app">SO here</a>. </font></div><div><font face="monospace"><br></font></div><div><font face="monospace">Here is the sbatch file contents:</font></div><div><font face="monospace"><br></font></div>#!/bin/sh<br>#SBATCH -N 1<br>#SBATCH -n 1<br>#SBATCH --job-name=sequentialBlur_alexnet_training_imagewoof_crossval<br>#SBATCH --nodelist=node003<br>module purge<br>module load gcc5 cuda10.1<br>module load openmpi/cuda/64<br>module load pytorch-py36-cuda10.1-gcc<br>module load ml-pythondeps-py36-cuda10.1-gcc<br>python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof $1 | tee alex_100_imwoof_seq_longtrain_cv_$1.txt</font><br><div><br></div><div>From nvidia-smi on the compute node:</div><div><font face="monospace"> Processes<br> Process ID : 320467<br> Type : C<br> Name : python3.6<br> Used GPU Memory : 2369 MiB<br> Process ID : 320574<br> Type : C<br> Name : python3.6<br> Used GPU Memory : 2369 MiB<br><br>[node003 ~]# nvidia-smi -q -d compute<br><br>==============NVSMI LOG==============<br><br>Timestamp : Fri Apr 3 15:27:49 2020<br>Driver Version : 440.33.01<br>CUDA Version : 10.2<br><br>Attached GPUs : 1<br>GPU 00000000:3B:00.0<br> Compute Mode : Default<br><br></font></div><div><font face="monospace"><br></font></div><div><font face="monospace">[~]# nvidia-smi<br>Fri Apr 3 15:28:49 2020<br>+-----------------------------------------------------------------------------+<br>| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |<br>|-------------------------------+----------------------+----------------------+<br>| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |<br>| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |<br>|===============================+======================+======================|<br>| 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off | 0 |<br>| N/A 42C P0 46W / 250W | 4750MiB / 32510MiB | 32% Default |<br>+-------------------------------+----------------------+----------------------+<br><br>+-----------------------------------------------------------------------------+<br>| Processes: GPU Memory |<br>| GPU PID Type Process name Usage |<br>|=============================================================================|<br>| 0 320467 C python3.6 2369MiB |<br>| 0 320574 C python3.6 2369MiB |<br>+-----------------------------------------------------------------------------+</font><br></div><div><font face="monospace"><br></font></div><div><font face="monospace">From htop:</font></div><div>320574 ouruser 20 0 12.2G 1538M 412M R 502. 0.8 14h45:59 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1<br>320467
ouruser
20 0 12.2G 1555M 412M D 390. 0.8 14h45:13 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0<br>320654
ouruser
20 0 12.2G 1555M 412M R 111. 0.8 3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0<br>320656
ouruser
20 0 12.2G 1555M 412M R 111. 0.8 3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0<br>320658
ouruser
20 0 12.2G 1538M 412M R 111. 0.8 3h00:54 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1<br>320660
ouruser
20 0 12.2G 1538M 412M R 111. 0.8 3h00:53 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1<br>320661
ouruser
20 0 12.2G 1538M 412M R 111. 0.8 3h00:54 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1<br>320655
ouruser
20 0 12.2G 1555M 412M R 55.8 0.8 3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0<br>320657
ouruser
20 0 12.2G 1555M 412M R 55.8 0.8 3h00:56 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 0<br>320659
ouruser
20 0 12.2G 1538M 412M R 55.8 0.8 3h00:53 python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof 1<br></div><div><br></div><div>Is PyTorch somehow working around Slurm and NOT locking a GPU since the user omitted --gres=gpu:1? How can I tell if MPS is really working?</div></div>