[slurm-users] gres:mps question

Thu Jan 9 18:50:51 UTC 2020

BLUF:
     Is the Nvidia MPS service required for the MPS gres to function in slurm with multiple GPUs in a single machine? (jobs using MPS don't need to span GPUs, just use a part of a GPU in a machine with multiple GPUs)
     Is there more detailed documentation available on how MPS should be set up and how it functions?

I'm playing with mps on a test machine and the documentation at https://slurm.schedmd.com/gres.html seems a bit vague. It implies it can be used across multiple GPUs, but then states that only one GPU per node may be configured for use with MPS.

When I test mps in slurm without the NVIDIA MPS service  (I am just starting to read up on the NVIDIA MPS service now) it does seem to only use one GPU.

In gres.conf
     NodeName=testmachine1 Name=gpu File=/dev/nvidia[0-1]
     NodeName=testmachine1 Name=mps count=200 File=/dev/nvidia[0-1]

In slurm.conf
     NodeName=testmachine1 Gres=gpu:2,mps:200 Sockets=1 CoresPerSocket=6

An array job posted with "-gres=mps:50" will put two job steps on the first GPU, but doesn't use the second GPU for mps jobs.

Is the Nvidia MPS service required for the MPS gres to function in slurm?
Is there more detailed documentation available on how MPS should be set up and how it functions?

We have a mixed set of work (shared GPU using 1 CPU core and a small percentage of one GPU versus dedicated GPU jobs using a whole number of GPUs and CPUs) on machines with 4 GPUs and it would be nice to have them co-exist instead of splitting the machines into two separate partitions for the two styles of jobs.

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200109/3e004ac5/attachment.htm>