[slurm-users] Combining Preemption and cons_res

Anthony Ruth ARuth at HEEsolar.com
Tue Nov 12 19:05:42 UTC 2019


Hello,

I recently changed our slurm.conf file to allow for job preemption. While making this change, I also chose to use select/cons_res to try and understand how preemption would interact with our future upgrade which will include GPUs.  The code we will run on the GPUs can only use a single CPU and so I would like the GPU nodes to take on two jobs at once. One job would use the GPU and a single CPU. The other job would use the remaining CPUs. For instance, if the node had 12 cores, it would run a job on the GPU and one CPU, and then another job on the remaining 11 cores. Those 11 cores should behave the same as the full 16 cores on another machine without a GPU with regards to partitions and job preemption.

To me there appears two logical ways of thinking this through. The node with the GPU could be viewed as two nodes, and each of these two nodes would be assigned a job using select/linear as the SelectType. (I have not seen any documentation describing how to do this, so I have not attempted it.) Or the resources of the node are individually assigned by the use of select/cons_res. (The manual suggests using select/cons_tres for GPUs, but I am told the plugin is not found so for now I am attempting with cons_res). Select/cons_res introduces a new problem for job preemption even for the simpler nodes which should only run one job per node.

It appears that select/cons_res is mostly the same as select/linear except for node sharing. Job preemption mostly works as expected, however for the partition which uses job preemption, multiple jobs are placed on the same node. This partition has OverSubscribe=FORCE:1. So slurm is not oversubscribing the CPUs but it is assigning fewer CPUs to the job than intended (probably 1 CPU per job). The job itself can also be called with a specified number of CPUs, or its own OverSubscribe setting (but the Oversubscribe setting of the job is overruled by the OverSubscribe setting of the partition). I can set a specific number of CPUs for the job but that is not what is desired. Instead the job should use all CPUs besides those reserved for the GPU.

Is it possible to apply select/linear to the portion of the cluster which participates in preemption and select/cons_tres to the GPU portion?
Can I use cons_res where a job requests to use all CPUs besides those reserved for the GPU?
Is it possible to allocate the GPU using select/linear?

Here are the relevant portions of slurm.conf:

SchedulerType=sched/builtin
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
#
# Node Configurations
#
NodeName=DEFAULT CPUs=2 RealMemory=2000 TmpDisk=6400 State=UNKNOWN
NodeName=devel CPUs=32 NodeAddr=workmaster NodeHostname=workmaster
NodeName=pegasus[0-1] CPUs=24 CoresPerSocket=12 ThreadsPerCore=2 NodeAddr=192.168.250.[5-6] NodeHostname=pegasus[0-1]
NodeName=steve[0-4] CPUs=16 CoresPerSocket=8 ThreadsPerCore=2 NodeAddr=192.168.250.[7-11] NodeHostname=steve[0-4]

#
# Partition Configurations
#
PartitionName=DEFAULT State=UP
PartitionName=low_pri Nodes=devel,pegasus[0-1],steve[0-4] Default=NO PreemptMode=REQUEUE PriorityTier=1 OverSubscribe=Exclusive
PartitionName=med_pri Nodes=devel,pegasus[0-1],steve[0-4] Default=YES PreemptMode=REQUEUE PriorityTier=2 OverSubscribe=Exclusive
PartitionName=hi_pri Nodes=devel,pegasus[0-1],steve[0-4] Default=NO PreemptMode=OFF PriorityTier=3 OverSubscribe=FORCE:1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191112/41753272/attachment.htm>


More information about the slurm-users mailing list