Hello,

I've recently adopted setting AutoDetect=nvml in our GPU nodes' gres.conf files to automatically populate Cores and Links for GPUs, which has been working well.

I'm now wondering if I can prioritize having single GPU jobs scheduled on NVLink pairs (these are PCIe A6000s) where one of the GPUs in the pair is already running a single GPU job, assuming the CPU socket with affinity has enough cores to handle the job. We have some users wanting to run single GPU jobs and others wanting to run dual GPU jobs, both on the same nodes, so we would prefer not to configure each NVLink pair as a single GRES, for better job throughput.

As is, I've observed that for a node with at least 4 GPUs and 2 sockets (one NVLink pair per socket), Slurm will prioritize evening out core allocation between the sockets. Once the second single GPU job is submitted, one GPU in each NVLink pair is taken up and a subsequent dual GPU job can still run, but doesn't have access to an NVLink pair.

We've also got a few nodes where single GPUs have failed, resulting in some NVLink'd pairs and usually a single non-NVLink'd GPU (3 or 7 total GPUs). It'd be ideal if single GPU jobs also got prioritized for scheduling on the non-NVLink'd GPU in this case.

Is this possible?

All the best,
Matthew

--
Matthew Baney
Assistant Director of Computational Systems
mbaney@umd.edu | (301) 405-6756
University of Maryland Institute for Advanced Computer Studies
3154 Brendan Iribe Center
8125 Paint Branch Dr.
College Park, MD 20742