[slurm-users] ignore gpu resources to scheduled the cpu based jobs

Tue Jun 16 13:49:46 UTC 2020

Not trying to argue unnecessarily, but what you describe is not a universal rule, regardless of QOS.

Our GPU nodes are members of 3 GPU-related partitions, 2 more resource-limited non-GPU partitions, and one of two larger-memory partitions. It’s set up this way to minimize idle resources (due to us not buying enough GPUs in those nodes to keep all the CPUs busy, plus our other nodes having limited numbers of DIMM slots for larger-memory jobs).

First terminal, results in a job running in the ‘any-interactive’ partition on gpunode002. We have a job submit plugin that automatically routes jobs to ‘interactive’, ‘gpu-interactive’, or ‘any-interactive’ depending on the resources requested:

=====

[renfro at login rosetta-job]$ type hpcshell
hpcshell is a function
hpcshell ()
{
    srun --partition=interactive $@ --pty bash -i
}
[renfro at login rosetta-job]$ hpcshell
[renfro at gpunode002(job 751070) rosetta-job]$

=====

Second terminal, simultaneous to first terminal, results in a job running in the ‘gpu-interactive’ partition on gpunode002:

=====

[renfro at login ~]$ hpcshell --gres=gpu
[renfro at gpunode002(job 751071) ~]$ squeue -t R -u $USER
JOBID  PARTI       NAME       USER ST         TIME S:C: NODES MIN_MEMORY NODELIST(REASON)         SUBMIT_TIME          START_TIME            END_TIME TRES_PER_NODE
751071 gpu-i       bash     renfro  R         0:08 *:*: 1     2000M      gpunode002       2020-06-16T08:27:50 2020-06-16T08:27:50 2020-06-16T10:27:50 gpu
751070 any-i       bash     renfro  R         0:18 *:*: 1     2000M      gpunode002       2020-06-16T08:27:40 2020-06-16T08:27:40 2020-06-16T10:27:41 N/A
[renfro at gpunode002(job 751071) ~]$

=====

Selected configuration details (excluding things like resource ranges and defaults):

NodeName=gpunode[001-003]  CoresPerSocket=14 RealMemory=382000 Sockets=2 ThreadsPerCore=1 Weight=10011 Gres=gpu:2
NodeName=gpunode004  CoresPerSocket=14 RealMemory=894000 Sockets=2 ThreadsPerCore=1 Weight=10021 Gres=gpu:2

PartitionName=gpu Default=NO MaxCPUsPerNode=16 ExclusiveUser=NO State=UP Nodes=gpunode[001-004]
PartitionName=gpu-debug Default=NO MaxCPUsPerNode=16 ExclusiveUser=NO State=UP Nodes=gpunode[001-004]
PartitionName=gpu-interactive Default=NO MaxCPUsPerNode=16 ExclusiveUser=NO State=UP Nodes=gpunode[001-004]
PartitionName=any-interactive Default=NO MaxCPUsPerNode=12 ExclusiveUser=NO State=UP Nodes=node[001-040],gpunode[001-004]
PartitionName=any-debug Default=NO MaxCPUsPerNode=12 ExclusiveUser=NO State=UP Nodes=node[001-040],gpunode[001-004]
PartitionName=bigmem Default=NO MaxCPUsPerNode=12 ExclusiveUser=NO State=UP Nodes=gpunode[001-003]
PartitionName=hugemem Default=NO MaxCPUsPerNode=12 ExclusiveUser=NO State=UP Nodes=gpunode004

> On Jun 16, 2020, at 8:14 AM, Diego Zuccato <diego.zuccato at unibo.it> wrote:
> 
> Il 16/06/20 09:39, Loris Bennett ha scritto:
> 
>>> Maybe it's already known and obvious, but... Remember that a node can be
>>> allocated to only one partition.
>> Maybe I am misunderstanding you, but I think that this is not the case.
>> A node can be in multiple partitions.
> *Assigned* to multiple partitions: OK.
> But once slurm schedules jon in "partGPU" on that node, the whole node
> is unavailable for jobs in "partCPU", even if the GPU job is using only
> 1% of the resources.
> 
>> We have nodes belonging to
>> individual research groups which are in both a separate partition just
>> for the group and in a 'scavenger' partition for everyone (but with
>> lower priority add maximum run-time).
> More or less our current config. Quite inefficient, at least for us: too
> many unuseable resources due to small jobs.
> 
>>> So, if you have the mixed nodes in bot
>>> partitions and there's a GPU job running, a non-gpu job will find that
>>> node marked as busy because it's allocated to another partition.
>>> That's why we're drastically reducing the number of partitions we have
>>> and will avoid shared nodes.
>> Again I don't this is explanation.  If a job is running on a GPU node,
>> but not using all the CPUs, then a CPU-only job should be able to start
>> on that node, unless some form of exclusivity has been set up, such as
>> ExclusiveUser=YES for the partition.
> Nope. The whole node gets allocated to one partition at a time. So if
> the GPU job and the CPU one are in different partitions, it's expected
> that only one starts. The behaviour you're looking for is the one of
> QoS: define a single partition w/ multiple QoS and both jobs will run
> concurrently.
> 
> If you think about it, that's the meaning of "partition" :)
> 
> --
> Diego Zuccato
> DIFA - Dip. di Fisica e Astronomia
> Servizi Informatici
> Alma Mater Studiorum - Università di Bologna
> V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
> tel.: +39 051 20 95786
>