[slurm-users] Requirement of one GPU job should run in GPU nodes in a cluster

Fri Dec 17 08:28:46 UTC 2021

Hi.

Isn't that exactly what cgroups are for?
If you use cgroups and request 1 core on a machine w/ N available, you 
will only use the one you requested, even if the others are idle. If 
another job gets scheduled on the same machine it's because the 
requested resources are available.
 From my (little) experience, usually the problem is the request for 
RAM: while users tend to estimate quite correctly the number of cores 
they need, they greatly overstimate the memory. Often by 3 orders of 
magnitude. For this, the 'seff' tool is quite educational, to the point 
that its output could be useful in the job completion mail :)

Il 17/12/2021 08:53, Steffen Grunewald ha scritto:
> On Fri, 2021-12-17 at 13:03:32 +0530, Sudeep Narayan Banerjee wrote:
>> Hello All: Can we please restrict one GPU job on one GPU node?
>>
>> That is,
>> a) when we submit a GPU job on an empty node (say gpu2) requesting 16 cores
>> as that gives the best performance in the GPU and it gives best performance.
>> b) Then another user flooded the CPU cores on gpu2 sharing the GPU
>> resources. The net results is a GPU job got hit by 40% performance in the
>> next run
>>
>> Can we make some changes in the slurm configuration such that when a GPU
>> job is submitted in a GPU node, no other job can enter that GPU node?
> 
> Hi,
> 
> your scenario is incomplete :/
> 
> In your scenario, a (job_submit?) script could probably change the number
> of cores requested to the maximum available, thus avoiding anything else
> entering the machine afterwards.
> But:
> 
> What if some CPU cores of the GPU machine are already in use? Even if that
> job behaves nicely at the time the GPU job gets scheduled to the machine,
> this doesn't guarantee that this won't change the next moment.
> 
> If your GPU machines are of identical configuration, the only feasible way
> seems to be to request a full machine.
> This won't work that easily if your setup is inhomogeneous, or/and if there
> are multiple GPUs in a single machine.
> 
> Sometimes there's no technical solution to social problems (assuming that
> CPU flooding happens on purpose and knowingly, not by accident), I'm afraid...
> 
> Best,
>   Steffen
> 

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786