One of our researchers asked whether it was possible to require a job to use NVLink-ed pairs of GPUs.
I see that there is a support ticket on the SchedMD site which covers this (
https://support.schedmd.com/show_bug.cgi?id=15995). That ticket is a few years old though. Does anyone happen to know whether support for this has been added in newer releases of SLURM?
The cluster in question does use "AutoDetect=nvml" in its gres.conf and the output of "slurmd -G" shows that SLURM is aware of the NVLink pairs. I assume the scheduler is trying to use that information. What I want to know is whether there is some way for an end-user to add a constraint (for example) to a job such that it only runs on an NVLink-ed pair of GPUs.
I do know that there are other ways to implement this such as requiring jobs to run with even numbers of GPUs, perhaps just on some nodes to allow single GPU jobs to run on the remaining nodes. I'm specifically asking about a flag or setting a user could apply to their jobs. If there is such a thing maybe someone here knows about it. If so I'd love to hear about it. Thanks!