[slurm-users] How to share GPU resources? (MPS or another way?)

Tue Oct 8 06:47:37 UTC 2019

Hello guys,

I'd like to ask the tips for GPU resource sharing with Slurm. I have multiple GPUs in my cluster and multiple users that spawn the
jobs as the slurm batch job. However, the GPU resource usage is depending on what the job doing and unevenness so some jobs doesn't
use GPU (a little of times) so much. On such cases, I'd like to make the jobs to be able to share the GPU resource like assigning 0.
5 GPU (1 means the job uses 1 GPU, like --gres=gpu:1).

Before asking here, I tried Slurm/mps (https://slurm.schedmd.com/gres.html#MPS_Management) that says "the same GPU can be allocated
as MPS generic resources to multiple jobs belonging to multiple users". However, that doesn't work as I expected at all. At first,
Slurm seems to work as designed. I put the mps configuration to slurm, turn on cons_tres plugin, then requiring small number of mps
count than mps count in gres.conf can start to assign multiple jobs into a node. However, mps server in the node doesn't when
*multiple users* request the jobs. At the case, it looks like an user's job is waiting to hold the GPU until another job holding the
GPU is running as well as gres gpu:1. And more, the NVIDIA docs looks to describe what I hit
(https://docs.nvidia.com/deploy/mps/index.html#topic_4_3). That seems like the mps-server will be created to each user and the
server will be running exclusively so I have my doubts the direction...

Here is where I stand for now but I'm not sure if it's expected behavior or not. Thus I'd like to hear the opinions because I may be
missing something or I may have another way to share GPU resources rather than mps.

Does anyone hits the same issue? Would anyone help me?

Thanks,

--------------------------------------------
露崎　浩太 (Kota Tsuyuzaki)
kota.tsuyuzaki.pc at hco.ntt.co.jp
NTT Software Innovation Center
+81-422-59-2837
---------------------------------------------