[slurm-users] PyTorch with Slurm and MPS work-around --gres=gpu:1?
Robert Kudyba
rkudyba at fordham.edu
Fri Apr 3 19:45:06 UTC 2020
Running Slurm 20.02 on Centos 7.7 with Bright Cluster 8.2. I'm wondering
how the below sbatch file is sharing a GPU.
MPS is running on the head node:
ps -auwx|grep mps
root 108581 0.0 0.0 12780 812 ? Ssl Mar23 0:27
/cm/local/apps/cuda-driver/libs/440.33.01/bin/nvidia-cuda-mps-control -d
The entire script is posted on SO here
<https://stackoverflow.com/questions/34709749/how-do-i-use-nvidia-multi-process-service-mps-to-run-multiple-non-mpi-cuda-app>
.
Here is the sbatch file contents:
#!/bin/sh
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --job-name=sequentialBlur_alexnet_training_imagewoof_crossval
#SBATCH --nodelist=node003
module purge
module load gcc5 cuda10.1
module load openmpi/cuda/64
module load pytorch-py36-cuda10.1-gcc
module load ml-pythondeps-py36-cuda10.1-gcc
python3.6 SequentialBlur_untrained.py alexnet 100 imagewoof $1 | tee
alex_100_imwoof_seq_longtrain_cv_$1.txt
>From nvidia-smi on the compute node:
Processes
Process ID : 320467
Type : C
Name : python3.6
Used GPU Memory : 2369 MiB
Process ID : 320574
Type : C
Name : python3.6
Used GPU Memory : 2369 MiB
[node003 ~]# nvidia-smi -q -d compute
==============NVSMI LOG==============
Timestamp : Fri Apr 3 15:27:49 2020
Driver Version : 440.33.01
CUDA Version : 10.2
Attached GPUs : 1
GPU 00000000:3B:00.0
Compute Mode : Default
[~]# nvidia-smi
Fri Apr 3 15:28:49 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2
|
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute
M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:3B:00.0 Off |
0 |
| N/A 42C P0 46W / 250W | 4750MiB / 32510MiB | 32%
Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU
Memory |
| GPU PID Type Process name Usage
|
|=============================================================================|
| 0 320467 C python3.6
2369MiB |
| 0 320574 C python3.6
2369MiB |
+-----------------------------------------------------------------------------+
>From htop:
320574 ouruser 20 0 12.2G 1538M 412M R 502. 0.8 14h45:59 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 1
320467 ouruser 20 0 12.2G 1555M 412M D 390. 0.8 14h45:13 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 0
320654 ouruser 20 0 12.2G 1555M 412M R 111. 0.8 3h00:56 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 0
320656 ouruser 20 0 12.2G 1555M 412M R 111. 0.8 3h00:56 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 0
320658 ouruser 20 0 12.2G 1538M 412M R 111. 0.8 3h00:54 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 1
320660 ouruser 20 0 12.2G 1538M 412M R 111. 0.8 3h00:53 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 1
320661 ouruser 20 0 12.2G 1538M 412M R 111. 0.8 3h00:54 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 1
320655 ouruser 20 0 12.2G 1555M 412M R 55.8 0.8 3h00:56 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 0
320657 ouruser 20 0 12.2G 1555M 412M R 55.8 0.8 3h00:56 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 0
320659 ouruser 20 0 12.2G 1538M 412M R 55.8 0.8 3h00:53 python3.6
SequentialBlur_untrained.py alexnet 100 imagewoof 1
Is PyTorch somehow working around Slurm and NOT locking a GPU since the
user omitted --gres=gpu:1? How can I tell if MPS is really working?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200403/139c58df/attachment.htm>
More information about the slurm-users
mailing list