[slurm-users] How to allocate resource for jobs without causing GPU fragmentation?

Fri Jun 26 15:46:29 UTC 2020

Hi,

I'm running a GPU cluster, and I would like to know if there is a way to
allocate resource for jobs without causing GPU fragmentation.

Currently, I'm using

> SelectType=select/cons_res
>
> SelectTypeParameters=CR_Core,CR_CORE_DEFAULT_DIST_BLOCK,CR_ONE_TASK_PER_CORE

and over-subscribing of CPU cores is set.

Let's say there are nodes A and B, and each of nodes A and B has 4 GPUs and
40 CPU cores.
The problem is, if jobs 1 and 2 request 1 GPU and 30 CPU cores each, both
of nodes A and B are selected for those jobs, which prevents a future job
requiring 4 GPUs from running on any of the two nodes.

If I'm not wrong, a simple workaround might be not managing CPU cores via
Slurm (e.g. CR_Memory), but it comes with downsides.

Could someone suggest any select plugins/parameters that can prevent such
GPU fragmentation, please?

Best,
Jaekyeom
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20200627/b789b6ce/attachment.htm>