[slurm-users] Requirement of one GPU job should run in GPU nodes in a cluster

Fri Dec 17 07:53:37 UTC 2021

On Fri, 2021-12-17 at 13:03:32 +0530, Sudeep Narayan Banerjee wrote:
> Hello All: Can we please restrict one GPU job on one GPU node?
> 
> That is,
> a) when we submit a GPU job on an empty node (say gpu2) requesting 16 cores
> as that gives the best performance in the GPU and it gives best performance.
> b) Then another user flooded the CPU cores on gpu2 sharing the GPU
> resources. The net results is a GPU job got hit by 40% performance in the
> next run
> 
> Can we make some changes in the slurm configuration such that when a GPU
> job is submitted in a GPU node, no other job can enter that GPU node?

Hi,

your scenario is incomplete :/

In your scenario, a (job_submit?) script could probably change the number
of cores requested to the maximum available, thus avoiding anything else 
entering the machine afterwards.
But:

What if some CPU cores of the GPU machine are already in use? Even if that
job behaves nicely at the time the GPU job gets scheduled to the machine,
this doesn't guarantee that this won't change the next moment.

If your GPU machines are of identical configuration, the only feasible way
seems to be to request a full machine.
This won't work that easily if your setup is inhomogeneous, or/and if there
are multiple GPUs in a single machine.

Sometimes there's no technical solution to social problems (assuming that
CPU flooding happens on purpose and knowingly, not by accident), I'm afraid...

Best,
 Steffen