[slurm-users] Revisit: Split a GPU cluster into GPU cores and shared cores

Thu Apr 19 09:31:50 MDT 2018

Chris,

> We do have the issue where the four free cores are on one socket,
> rather than being equally distributed across the sockets. When I
> solicited advice from SchedMD for our config it seems they are
> doing some work in this area that may hopefully surface in the next
> major release (though likely only as a "beta" proof of concept).

I think I am forced to wait unfortunately. Thanks a lot for this response. I
will keep an eye on that bug report.

Thanks,

Barry

On Thu, Apr 19, 2018 at 09:58:16AM +1000, Christopher Samuel wrote:
> On 19/04/18 07:11, Barry Moore wrote:
> 
> > My situation is similar. I have a GPU cluster with gres.conf entries
> > which look like:
> > 
> > NodeName=gpu-XX Name=gpu File=/dev/nvidia[0-1] CPUs=[0-5]
> > NodeName=gpu-XX Name=gpu File=/dev/nvidia[2-3] CPUs=[6-11]
> > 
> > However, as you can imagine 8 cores sit idle on these machines for no
> > reason. Is there a way to easily set this up?
> 
> We do this with overlapping partitions:
> 
> PartitionName=skylake Default=YES State=DOWN [...] MaxCPUsPerNode=32
> PartitionName=skylake-gpu Default=NO State=DOWN [...] Priority=1000
> 
> Our submit filter then forces jobs that request gres=gpu into the
> skylake-gpu partition and those that don't into the skylake partition.
> 
> Our gres.conf has:
> 
> NodeName=[...] Name=gpu Type=p100 File=/dev/nvidia0 Cores=0-17
> NodeName=[...] Name=gpu Type=p100 File=/dev/nvidia1 Cores=18-35
> 
> But of course the Cores= spec is just advisory to the scheduler,
> the user can make that a hard requirement by specifying:
> 
> --gres-flags=enforce-binding
> 
> We do have the issue where the four free cores are on one socket,
> rather than being equally distributed across the sockets. When I
> solicited advice from SchedMD for our config it seems they are
> doing some work in this area that may hopefully surface in the next
> major release (though likely only as a "beta" proof of concept).
> 
> https://bugs.schedmd.com/show_bug.cgi?id=4717
> 
> All the best,
> Chris
> -- 
>  Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
> 

-- 
Barry E Moore II, PhD
E-mail: bmooreii at pitt.edu

Assistant Research Professor
Center for Research Computing
University of Pittsburgh
Pittsburgh, PA 15260