[slurm-users] MPS Count option clarification and TensorFlow 2/PyTorch greediness causing out of memory OOMs

Robert Kudyba rkudyba at fordham.edu
Wed Aug 26 00:49:22 UTC 2020

Comparing the Slurm  MPS configuration example here
<https://slurm.schedmd.com/gres.html#MPS_config_example_2>, our gres.conf
has this:
NodeName=node[001-003] Name=mps Count=400

What does "Count" really mean and how do you use this number?

>From that web page <https://slurm.schedmd.com/gres.html#MPS_Management> you
"MPS configuration includes only the Name and Count parameters: The count
of gres/mps elements will be evenly distributed across all GPUs configured
on the node. This is similar to case 1, but places duplicate configuration
in the gres.conf file."

Also on that page there is this:
# Example 1 of gres.conf
# Configure support for four GPUs (with MPS)
Name=gpu Type=gp100 File=/dev/nvidia0 Cores=0,1
Name=gpu Type=gp100 File=/dev/nvidia1 Cores=0,1
Name=gpu Type=p6000 File=/dev/nvidia2 Cores=2,3
Name=gpu Type=p6000 File=/dev/nvidia3 Cores=2,3
# Set gres/mps Count value to 100 on each of the 4 available GPUs
Name=mps Count=400

And then this (sidenote, the typo of "*different*" in the example)

# Example 2 of gres.conf
# Configure support for four *differernt *GPU types (with MPS)
Name=gpu Type=gtx1080 File=/dev/nvidia0 Cores=0,1
Name=gpu Type=gtx1070 File=/dev/nvidia1 Cores=0,1
Name=gpu Type=gtx1060 File=/dev/nvidia2 Cores=2,3
Name=gpu Type=gtx1050 File=/dev/nvidia3 Cores=2,3
Name=mps Count=1300   File=/dev/nvidia0
Name=mps Count=1200   File=/dev/nvidia1
Name=mps Count=1100   File=/dev/nvidia2
Name=mps Count=1000   File=/dev/nvidia3

And lower in the page, not sure what "to a job of step" means:
The percentage will be calculated based upon the portion of the configured
Count on the Gres is allocated to a job of step. For example, a job
requesting "--gres=gpu:200" and using configuration example 2 above would
be allocated
15% of the gtx1080 (File=/dev/nvidia0, 200 x 100 / 1300 = 15), or
16% of the gtx1070 (File=/dev/nvidia0, 200 x 100 / 1200 = 16), or
18% of the gtx1060 (File=/dev/nvidia0, 200 x 100 / 1100 = 18), or
20% of the gtx1050 (File=/dev/nvidia0, 200 x 100 / 1000 = 20).

How were the count values of 1300, 1200, 1100 and 1000 determined?

Now segueing to TensorFlow 2 and PyTorch memory greediness.

Using the same "Deep Convolutional Generative Adversarial Networks
sample script and in my sbatch file I added:
#SBATCH --gres=mps:35
echo here is the CUDA-MPS-ActiveThread-Percentage

So the job log file showed this:
here is value of TF_FORCE_GPU_ALLOW_GROWTH true
here is the CUDA-MPS-ActiveThread-Percentage 17

So that 17 is half of the 35 I see with the MPS option. The description
from the SchedMD page reads:
"The percentage will be calculated based upon the portion of the configured
Count on the Gres is allocated to a job of step."

So how does Count=400 from the gres.conf file factor in? Does it mean the
job is using 17% of the available threads of the GPU? From nvidia-smi on
this Slurm job:
| Processes:                                                       GPU
Memory |
|  GPU       PID   Type   Process name                             Usage
|    0     59793      C   python3.6
1135MiB |

The GPU has 32 GB:

|   0  Tesla V100-PCIE...  On   | 00000000:3B:00.0 Off |
 0 |
| N/A   49C    P0   128W / 250W |   3417MiB / 32510MiB |     96%
 Default |

So MPS and the Count option do not help with GPU memory. So I'm trying to
find ways to tell our users how to avoid the OOM's. The most common advice
is to use smaller batches
<https://stackoverflow.com/questions/37736071/tensorflow-out-of-memory> but
the complaint we get is it really slows down their jobs doing so.

So I just found the section 2 Physical GPUs, 2 Logical GPUs from the
TensorFlow 2
<https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth> docs,
works by setting a hard limit, in this case 2048 MB, adding the below code
after import tensorflow as tf

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
#  # Restrict TensorFlow to only allocate 1GB of memory on the first GPU

    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
#    # Virtual devices must be set before GPUs have been initialized

I know this is outside of the scope of Slurm but I was hoping someone had a
more graceful way rather than a hard memory limit to achieve this. The
first option mentioned in the TF docs state: The first option is to turn on
memory growth by calling "tf.config.experimental.set_memory_growth, which
attempts to allocate only as much GPU memory as needed for the runtime
allocations: it starts out allocating very little memory, and as the
program gets run and more GPU memory is needed, we extend the GPU memory
region allocated to the TensorFlow process. Note we do not release memory,
since it can lead to memory fragmentation." I've found using the Recurrent
Neural Network Example
it jumps up to 30 GB:

 I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created
TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 30486
MB memory) -> physical GPU (device: 0, name: Tesla V100-PCIE-32GB, pci bus
id: 0000:3b:00.0, compute capability: 7.0)

But at least we have a way to deal with our users as we have many TF and
PyTorch CNN jobs.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200825/70185c57/attachment.htm>

More information about the slurm-users mailing list