<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p><font face="Adobe Helvetica">Hi,<br>
<br>
On a cluster comprised of quad-processors nodes, I encountered
the following issue during a job allocation for an application
that has 4 tasks, each requiring 3 processors (this is exactly
the example you are providing in the --cpus-per-task section of
the salloc manpage).</font></p>
<p><font face="Adobe Helvetica">Here are the facts:<br>
<br>
First, the cluster configuration : 4 nodes, 1 socket/node, 4
cores/socket, 1 thread/core<br>
<br>
</font><tt>> sinfo -V</tt><tt><br>
</tt><tt>slurm 17.11.9-2</tt><tt><br>
</tt><tt><br>
</tt><tt>> sinfo</tt><tt><br>
</tt><tt>PARTITION AVAIL TIMELIMIT NODES CPUS(A/I/O/T)
STATE NODELIST</tt><tt><br>
</tt><tt>any* up 2:00:00 4 0/16/0/16
idle~ n[101-104]</tt><font face="Adobe Helvetica"><br>
<br>
1) With the CR_CORE consumable resource:<br>
<br>
</font><tt>> scontrol show conf|grep -i select</tt><tt><br>
</tt><tt>SelectType = select/cons_res</tt><tt><br>
</tt><tt>SelectTypeParameters = CR_CORE</tt><tt><br>
</tt><tt><br>
</tt><tt>> salloc -n4 -c3</tt><tt><br>
</tt><tt>salloc: Granted job allocation 218</tt><tt><br>
</tt><tt>> squeue</tt><tt><br>
</tt><tt> JOBID QOS PRIORITY PARTITION
NAME USER ST TIME NODES CPUS NODELIST(REASON)</tt><tt><br>
</tt><tt> 218 normal 43 any
bash sila R 0:03 3 10 n[101-103]</tt><tt><br>
</tt><tt>> srun hostname</tt><tt><br>
</tt><tt>srun: error: Unable to create step for job 218: More
processors requested than permitted</tt><tt><br>
</tt><font face="Adobe Helvetica"><br>
We can see that the number of granted processors and nodes is
completely wrong : 10 cpus instead of 12 and only 3 nodes
instead of 4.<br>
The correct behaviour when requesting 4 tasks (-n 4) with 3
processors per tasks (-c 3) on a quad-core nodes cluster should
be that the controller grant an allocation of 4 nodes, one for
each of the 4 tasks. <br>
<br>
Note that when specifying --tasks-per-node=1, the behaviour is
correct:<br>
<br>
</font><tt>> salloc -n4 -c3 --tasks-per-node=1</tt><tt><br>
</tt><tt>salloc: Granted job allocation 221</tt><tt><br>
</tt><tt>> squeue</tt><tt><br>
</tt><tt> JOBID QOS PRIORITY PARTITION
NAME USER ST TIME NODES CPUS NODELIST(REASON)</tt><tt><br>
</tt><tt> 221 normal 43 any
bash sila R 0:03 4 12 n[101-104]</tt><tt><br>
</tt><tt>> srun hostname</tt><tt><br>
</tt><tt>n101</tt><tt><br>
</tt><tt>n103</tt><tt><br>
</tt><tt>n102</tt><tt><br>
</tt><tt>n104</tt></p>
<p><font face="Adobe Helvetica">2) With the CR_SOCKET consumable
resource:<br>
<br>
</font><tt>> scontrol show conf|grep -i select</tt><tt><br>
</tt><tt>SelectType = select/cons_res</tt><tt><br>
</tt><tt>SelectTypeParameters = CR_SOCKET</tt><tt><br>
</tt><tt><br>
</tt><tt>> salloc -n4 -c3</tt><tt><br>
</tt><tt>salloc: Granted job allocation 226</tt><tt><br>
</tt><tt>> squeue</tt><tt><br>
</tt><tt> JOBID QOS PRIORITY PARTITION
NAME USER ST TIME NODES CPUS NODELIST(REASON)</tt><tt><br>
</tt><tt> 226 normal 43 any
bash sila R 0:02 3 12 n[101-103]</tt><font
face="Adobe Helvetica"><br>
<br>
Here, slurm allocates the right number of processors (12) but
the number of allocated nodes is wrong : 3 instead of 4. Then,
there will be 2 tasks on the same node (n101):<br>
<br>
> srun hostname<br>
n102<br>
n101<br>
n101<br>
n103<br>
<br>
Again, when specifying --tasks-per-node=1, the behaviour is
correct:<br>
<br>
</font><tt>> salloc -n4 -c3 --tasks-per-node=1</tt><tt><br>
</tt><tt>salloc: Granted job allocation 230</tt><tt><br>
</tt><tt>> squeue</tt><tt><br>
</tt><tt> JOBID QOS PRIORITY PARTITION
NAME USER ST TIME NODES CPUS NODELIST(REASON)</tt><tt><br>
</tt><tt> 230 normal 43 any
bash sila R 0:03 4 16 n[101-104]</tt><tt><br>
</tt><font face="Adobe Helvetica"><br>
Note that 16 processors have been allocated instead of 12 but
this is correct because slurm is configured with the CR_Socket
consumable resource (each node has got 1 socket and 4
cores/socket). The srun command is as expected :<br>
<br>
</font><tt>sila@master2-l422:~> srun hostname</tt><tt><br>
</tt><tt>n101</tt><tt><br>
</tt><tt>n102</tt><tt><br>
</tt><tt>n103</tt><tt><br>
</tt><tt>n104</tt><font face="Adobe Helvetica"><br>
<br>
3) Conclusion and fix :<br>
<br>
I thought that the --tasks-per-node should not be mandatory to
obtain the right behaviour so I did some investigations.<br>
<br>
I think that a bug has been introduced when unifying allocation
code for CR_Socket and CR_Core (commit 6fa3d5ad) in the Step 3
of the _allocate_sc() function
(src/plugins/select/cons_res/job_test.c) when computing the
avail_cpus:<br>
<br>
</font><tt>src/plugins/select/cons_res/job_test.c,
_allocate_sc(...): </tt><tt><br>
</tt><tt><br>
</tt><tt> if (cpus_per_task < 2) {</tt><tt><br>
</tt><tt> avail_cpus = num_tasks;</tt><tt><br>
</tt><tt> } else if ((ntasks_per_core == 1) &&</tt><tt><br>
</tt><tt> (cpus_per_task > threads_per_core))
{</tt><tt><br>
</tt><tt> /* find out how many cores a task will
use */</tt><tt><br>
</tt><tt> int task_cores = (cpus_per_task +
threads_per_core - 1) /</tt><tt><br>
</tt><tt> threads_per_core;</tt><tt><br>
</tt><tt> int task_cpus = task_cores *
threads_per_core;</tt><tt><br>
</tt><tt> /* find out how many tasks can fit on a
node */</tt><tt><br>
</tt><tt> int tasks = avail_cpus / task_cpus;</tt><tt><br>
</tt><tt> /* how many cpus the the job would use on
the node */</tt><tt><br>
</tt><tt> avail_cpus = tasks * task_cpus;</tt><tt><br>
</tt><tt> /* subtract out the extra cpus. */</tt><tt><br>
</tt><tt> avail_cpus -= (tasks * (task_cpus -
cpus_per_task));</tt><tt><br>
</tt><tt> } else {</tt><tt><br>
</tt><tt> j = avail_cpus / cpus_per_task;</tt><tt><br>
</tt><tt> num_tasks = MIN(num_tasks, j);</tt><tt><br>
</tt><tt> if
(job_ptr->details->ntasks_per_node)
<- problem</tt><tt><br>
</tt><tt> avail_cpus = num_tasks *
cpus_per_task;</tt><tt><br>
</tt><tt> }</tt><tt><br>
</tt><font face="Adobe Helvetica"><br>
<br>
The 'if (job_ptr->details->ntasks_per_node)' condition
marked above as 'problem' prevents the avail_cpus to be
correctly computed when --ntasks_per_node is NOT specified (and
</font><font face="Adobe Helvetica"><tt>cpus_per_task<font
face="Adobe Helvetica">>1)</font></tt>. Before unifying
the _allocate_sockets and _allocate_cores functions in the
_allocate_sc function, this condition was only present in the
_allocate_socket() function code. It appears that it is
unnecessary in the _allocate_sc() code and 'avail_cpus' should
be computed without any condition.</font></p>
<p><font face="Adobe Helvetica">With the following patch, the slurm
controller with select/cons_res (CR_CORE and CR_SOCKET) is doing
his job correctly when allocating with --cpus-per-task (-c)
without the need of specifying the --ntasks_per_node option:<br>
<br>
</font><tt>diff --git a/src/plugins/select/cons_res/job_test.c
b/src/plugins/select/cons_res/job_test.c</tt><tt><br>
</tt><tt>index 25e0b8875b..4e704e8b65 100644</tt><tt><br>
</tt><tt>--- a/src/plugins/select/cons_res/job_test.c</tt><tt><br>
</tt><tt>+++ b/src/plugins/select/cons_res/job_test.c</tt><tt><br>
</tt><tt>@@ -474,8 +474,7 @@ static uint16_t _allocate_sc(struct
job_record *job_ptr, bitstr_t *core_map,</tt><tt><br>
</tt><tt> } else {</tt><tt><br>
</tt><tt> j = avail_cpus / cpus_per_task;</tt><tt><br>
</tt><tt> num_tasks = MIN(num_tasks, j);</tt><tt><br>
</tt><tt>- if
(job_ptr->details->ntasks_per_node)</tt><tt><br>
</tt><tt>- avail_cpus = num_tasks *
cpus_per_task;</tt><tt><br>
</tt><tt>+ avail_cpus = num_tasks * cpus_per_task;</tt><tt><br>
</tt><tt> }</tt><tt><br>
</tt><tt> </tt><tt><br>
</tt><tt> if ((job_ptr->details->ntasks_per_node
&&</tt><tt><br>
</tt><font face="Adobe Helvetica"><br>
Test results OK after applying the patch:<br>
1) CR_CORE:<br>
</font><tt>> scontrol show conf|grep -i select</tt><tt><br>
</tt><tt>SelectType = select/cons_res</tt><tt><br>
</tt><tt>SelectTypeParameters = CR_CORE</tt><tt><br>
</tt><tt>> salloc -n4 -c3</tt><tt><br>
</tt><tt>salloc: Granted job allocation 234</tt><tt><br>
</tt><tt>> squeue </tt><tt><br>
</tt><tt> JOBID QOS PRIORITY PARTITION
NAME USER ST TIME NODES CPUS NODELIST(REASON)</tt><tt><br>
</tt><tt> 234 normal 43 any
bash sila R 0:02 4 12 n[101-104]</tt><font
face="Adobe Helvetica"><br>
<br>
2) CR_SOCKET:<br>
</font><tt>> scontrol show conf|grep -i select</tt><tt><br>
</tt><tt>SelectType = select/cons_res</tt><tt><br>
</tt><tt>SelectTypeParameters = CR_SOCKET</tt><tt><br>
</tt><tt>> salloc -n4 -c3</tt><tt><br>
</tt><tt>salloc: Granted job allocation 233</tt><tt><br>
</tt><tt>> squeue</tt><tt><br>
</tt><tt> JOBID QOS PRIORITY PARTITION
NAME USER ST TIME NODES CPUS NODELIST(REASON)</tt><tt><br>
</tt><tt> 233 normal 43 any
bash sila R 0:03 4 16 n[101-104]</tt><font
face="Adobe Helvetica"><br>
<br>
</font></p>
<p><font face="Adobe Helvetica">What do you think?<br>
</font></p>
<p><font face="Adobe Helvetica">Best regards,</font></p>
<p><font face="Adobe Helvetica">Didier</font><br>
</p>
</body>
</html>