[slurm-users] Revisit: Split a GPU cluster into GPU cores and shared cores
Barry Moore
moore0557 at gmail.com
Thu Apr 19 09:31:50 MDT 2018
Chris,
> We do have the issue where the four free cores are on one socket,
> rather than being equally distributed across the sockets. When I
> solicited advice from SchedMD for our config it seems they are
> doing some work in this area that may hopefully surface in the next
> major release (though likely only as a "beta" proof of concept).
I think I am forced to wait unfortunately. Thanks a lot for this response. I
will keep an eye on that bug report.
Thanks,
Barry
On Thu, Apr 19, 2018 at 09:58:16AM +1000, Christopher Samuel wrote:
> On 19/04/18 07:11, Barry Moore wrote:
>
> > My situation is similar. I have a GPU cluster with gres.conf entries
> > which look like:
> >
> > NodeName=gpu-XX Name=gpu File=/dev/nvidia[0-1] CPUs=[0-5]
> > NodeName=gpu-XX Name=gpu File=/dev/nvidia[2-3] CPUs=[6-11]
> >
> > However, as you can imagine 8 cores sit idle on these machines for no
> > reason. Is there a way to easily set this up?
>
> We do this with overlapping partitions:
>
> PartitionName=skylake Default=YES State=DOWN [...] MaxCPUsPerNode=32
> PartitionName=skylake-gpu Default=NO State=DOWN [...] Priority=1000
>
> Our submit filter then forces jobs that request gres=gpu into the
> skylake-gpu partition and those that don't into the skylake partition.
>
> Our gres.conf has:
>
> NodeName=[...] Name=gpu Type=p100 File=/dev/nvidia0 Cores=0-17
> NodeName=[...] Name=gpu Type=p100 File=/dev/nvidia1 Cores=18-35
>
> But of course the Cores= spec is just advisory to the scheduler,
> the user can make that a hard requirement by specifying:
>
> --gres-flags=enforce-binding
>
> We do have the issue where the four free cores are on one socket,
> rather than being equally distributed across the sockets. When I
> solicited advice from SchedMD for our config it seems they are
> doing some work in this area that may hopefully surface in the next
> major release (though likely only as a "beta" proof of concept).
>
> https://bugs.schedmd.com/show_bug.cgi?id=4717
>
> All the best,
> Chris
> --
> Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
>
--
Barry E Moore II, PhD
E-mail: bmooreii at pitt.edu
Assistant Research Professor
Center for Research Computing
University of Pittsburgh
Pittsburgh, PA 15260
More information about the slurm-users
mailing list