[slurm-users] Limit concurrent gpu resources

Wed Apr 24 18:11:45 UTC 2019

Here's how we handle this here:

Create a separate partition named debug that also contains that node. 
Give the debug partition a very short timelimit, say 30 - 60 minutes. 
Long enough for debugging, but too short to do any real work. Make the 
priority of the debug partition much higher than the regular partition. 
With that set up, they may not get a GPU right away, but their job 
should go to the head of the queue so as soon as one becomes available, 
their job will get it.

--
Prentice

On 4/24/19 11:06 AM, Mike Cammilleri wrote:
> Hi everyone,
>
> We have a single node with 8 gpus. Users often pile up lots of pending 
> jobs and are using all 8 at the same time, but for a user who just 
> wants to do a short run debug job and needs one of the gpus, they are 
> having to wait too long for a gpu to free up. Is there a way with 
> gres.conf or qos to limit the number of concurrent gpus in use for all 
> users? Most jobs submitted are single jobs, so they request a gpu with 
> --gres=gpu:1 but submit many (no array), and our gres.conf looks like 
> the following
>
> Name=gpu File=/dev/nvidia0 #CPUs=0,1,2,3
> Name=gpu File=/dev/nvidia1 #CPUs=4,5,6,7
> Name=gpu File=/dev/nvidia2 #CPUs=8,9,10,11
> Name=gpu File=/dev/nvidia3 #CPUs=12,13,14,15
> Name=gpu File=/dev/nvidia4 #CPUs=16,17,18,19
> Name=gpu File=/dev/nvidia5 #CPUs=20,21,22,23
> Name=gpu File=/dev/nvidia6 #CPUs=24,25,26,27
> Name=gpu File=/dev/nvidia7 #CPUs=28,29,30,31
>
> I thought of insisting that they submit the jobs as an array and limit 
> with %7, but maybe there's a more elegant solution using the config.
>
> Any tips appreciated.
>
> Mike Cammilleri
>
> Systems Administrator
>
> Department of Statistics | UW-Madison
>
> 1300 University Ave | Room 1280
> 608-263-6673 | mikec at stat.wisc.edu
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190424/39b60c1d/attachment-0001.html>