[slurm-users] select/cons_res - found bug when allocating job with --cpus-per-task (-c) option on slurm 17.11.9 (fix included).

Wed Sep 5 08:33:13 MDT 2018

Hi,

On a cluster comprised of quad-processors nodes, I encountered the 
following issue during a job allocation for an application that has 4 
tasks, each requiring 3 processors (this is exactly the example you are 
providing in the --cpus-per-task section of the salloc manpage).

Here are the facts:

First, the cluster configuration : 4 nodes, 1 socket/node, 4 
cores/socket, 1 thread/core

 > sinfo -V
slurm 17.11.9-2

 > sinfo
PARTITION        AVAIL  TIMELIMIT NODES   CPUS(A/I/O/T) STATE  NODELIST
any*             up       2:00:00     4       0/16/0/16 idle~  n[101-104]

1) With the CR_CORE consumable resource:

 > scontrol show conf|grep -i select
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE

 > salloc -n4 -c3
salloc: Granted job allocation 218
 > squeue
                JOBID      QOS PRIORITY    PARTITION NAME     USER  
ST       TIME  NODES  CPUS  NODELIST(REASON)
                  218   normal       43          any bash     sila   
R       0:03      3    10  n[101-103]
 > srun hostname
srun: error: Unable to create step for job 218: More processors 
requested than permitted

We can see that the number of granted processors and nodes is completely 
wrong : 10 cpus instead of 12 and only 3 nodes instead of 4.
The correct behaviour when requesting 4 tasks (-n 4) with 3 processors 
per tasks (-c 3) on a quad-core nodes cluster should be that the 
controller grant an allocation of 4 nodes, one for each of the 4 tasks.

Note that when specifying --tasks-per-node=1, the behaviour is correct:

 > salloc -n4 -c3 --tasks-per-node=1
salloc: Granted job allocation 221
 > squeue
                JOBID      QOS PRIORITY    PARTITION NAME     USER  
ST       TIME  NODES  CPUS  NODELIST(REASON)
                  221   normal       43          any bash     sila   
R       0:03      4    12  n[101-104]
 > srun hostname
n101
n103
n102
n104

2) With the CR_SOCKET consumable resource:

 > scontrol show conf|grep -i select
SelectType              = select/cons_res
SelectTypeParameters    = CR_SOCKET

 > salloc -n4 -c3
salloc: Granted job allocation 226
 > squeue
                JOBID      QOS PRIORITY    PARTITION NAME     USER  
ST       TIME  NODES  CPUS  NODELIST(REASON)
                  226   normal       43          any bash     sila   
R       0:02      3    12  n[101-103]

Here, slurm allocates the right number of processors (12) but the number 
of allocated nodes is wrong : 3 instead of 4. Then, there will be 2 
tasks on the same node (n101):

 > srun hostname
n102
n101
n101
n103

Again,  when specifying --tasks-per-node=1, the behaviour is correct:

 > salloc -n4 -c3 --tasks-per-node=1
salloc: Granted job allocation 230
 > squeue
                JOBID      QOS PRIORITY    PARTITION NAME     USER  
ST       TIME  NODES  CPUS  NODELIST(REASON)
                  230   normal       43          any bash     sila  
R        0:03      4    16  n[101-104]

Note that 16 processors have been allocated instead of 12 but this is 
correct because slurm is configured with the CR_Socket consumable 
resource (each node has got 1 socket and 4 cores/socket). The srun 
command is as expected :

sila at master2-l422:~> srun hostname
n101
n102
n103
n104

3) Conclusion and fix :

I thought that the --tasks-per-node should not be mandatory to obtain 
the right behaviour so I did some investigations.

I think that a bug has been introduced when unifying allocation code for 
CR_Socket and CR_Core (commit 6fa3d5ad) in the Step 3 of the 
_allocate_sc() function (src/plugins/select/cons_res/job_test.c) when 
computing the avail_cpus:

src/plugins/select/cons_res/job_test.c, _allocate_sc(...):

         if (cpus_per_task < 2) {
                 avail_cpus = num_tasks;
         } else if ((ntasks_per_core == 1) &&
                    (cpus_per_task > threads_per_core)) {
                 /* find out how many cores a task will use */
                 int task_cores = (cpus_per_task + threads_per_core - 1) /
                                  threads_per_core;
                 int task_cpus  = task_cores * threads_per_core;
                 /* find out how many tasks can fit on a node */
                 int tasks = avail_cpus / task_cpus;
                 /* how many cpus the the job would use on the node */
                 avail_cpus = tasks * task_cpus;
                 /* subtract out the extra cpus. */
                 avail_cpus -= (tasks * (task_cpus - cpus_per_task));
         } else {
                 j = avail_cpus / cpus_per_task;
                 num_tasks = MIN(num_tasks, j);
                 if (job_ptr->details->ntasks_per_node) <- problem
                         avail_cpus = num_tasks * cpus_per_task;
         }


The 'if (job_ptr->details->ntasks_per_node)' condition marked above as 
'problem' prevents the avail_cpus to be correctly computed when 
--ntasks_per_node is NOT specified (and cpus_per_task>1). Before 
unifying the _allocate_sockets and _allocate_cores functions in the 
_allocate_sc function, this condition was only present in the 
_allocate_socket() function code. It appears that it is unnecessary in 
the _allocate_sc() code and 'avail_cpus' should be computed without any 
condition.

With the following patch, the slurm controller with select/cons_res 
(CR_CORE and CR_SOCKET) is doing his job correctly when allocating with 
--cpus-per-task (-c) without the need of specifying the 
--ntasks_per_node option:

diff --git a/src/plugins/select/cons_res/job_test.c 
b/src/plugins/select/cons_res/job_test.c
index 25e0b8875b..4e704e8b65 100644
--- a/src/plugins/select/cons_res/job_test.c
+++ b/src/plugins/select/cons_res/job_test.c
@@ -474,8 +474,7 @@ static uint16_t _allocate_sc(struct job_record 
*job_ptr, bitstr_t *core_map,
         } else {
                 j = avail_cpus / cpus_per_task;
                 num_tasks = MIN(num_tasks, j);
-               if (job_ptr->details->ntasks_per_node)
-                       avail_cpus = num_tasks * cpus_per_task;
+               avail_cpus = num_tasks * cpus_per_task;
         }

         if ((job_ptr->details->ntasks_per_node &&

Test results OK after applying the patch:
1) CR_CORE:
 > scontrol show conf|grep -i select
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE
 > salloc -n4 -c3
salloc: Granted job allocation 234
 > squeue
                JOBID      QOS PRIORITY    PARTITION NAME     USER  
ST       TIME  NODES  CPUS  NODELIST(REASON)
                  234   normal       43          any bash     sila   
R       0:02      4    12  n[101-104]

2) CR_SOCKET:
 > scontrol show conf|grep -i select
SelectType              = select/cons_res
SelectTypeParameters    = CR_SOCKET
 > salloc -n4 -c3
salloc: Granted job allocation 233
 > squeue
                JOBID      QOS PRIORITY    PARTITION NAME     USER  
ST       TIME  NODES  CPUS  NODELIST(REASON)
                  233   normal       43          any bash     sila   
R       0:03      4    16  n[101-104]

What do you think?

Best regards,

Didier

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180905/5caa5385/attachment-0001.html>