[slurm-users] select/cons_res - found bug when allocating job with --cpus-per-task (-c) option on slurm 17.11.9 (fix included).
Didier GAZEN
didier.gazen at aero.obs-mip.fr
Wed Sep 5 08:33:13 MDT 2018
Hi,
On a cluster comprised of quad-processors nodes, I encountered the
following issue during a job allocation for an application that has 4
tasks, each requiring 3 processors (this is exactly the example you are
providing in the --cpus-per-task section of the salloc manpage).
Here are the facts:
First, the cluster configuration : 4 nodes, 1 socket/node, 4
cores/socket, 1 thread/core
> sinfo -V
slurm 17.11.9-2
> sinfo
PARTITION AVAIL TIMELIMIT NODES CPUS(A/I/O/T) STATE NODELIST
any* up 2:00:00 4 0/16/0/16 idle~ n[101-104]
1) With the CR_CORE consumable resource:
> scontrol show conf|grep -i select
SelectType = select/cons_res
SelectTypeParameters = CR_CORE
> salloc -n4 -c3
salloc: Granted job allocation 218
> squeue
JOBID QOS PRIORITY PARTITION NAME USER
ST TIME NODES CPUS NODELIST(REASON)
218 normal 43 any bash sila
R 0:03 3 10 n[101-103]
> srun hostname
srun: error: Unable to create step for job 218: More processors
requested than permitted
We can see that the number of granted processors and nodes is completely
wrong : 10 cpus instead of 12 and only 3 nodes instead of 4.
The correct behaviour when requesting 4 tasks (-n 4) with 3 processors
per tasks (-c 3) on a quad-core nodes cluster should be that the
controller grant an allocation of 4 nodes, one for each of the 4 tasks.
Note that when specifying --tasks-per-node=1, the behaviour is correct:
> salloc -n4 -c3 --tasks-per-node=1
salloc: Granted job allocation 221
> squeue
JOBID QOS PRIORITY PARTITION NAME USER
ST TIME NODES CPUS NODELIST(REASON)
221 normal 43 any bash sila
R 0:03 4 12 n[101-104]
> srun hostname
n101
n103
n102
n104
2) With the CR_SOCKET consumable resource:
> scontrol show conf|grep -i select
SelectType = select/cons_res
SelectTypeParameters = CR_SOCKET
> salloc -n4 -c3
salloc: Granted job allocation 226
> squeue
JOBID QOS PRIORITY PARTITION NAME USER
ST TIME NODES CPUS NODELIST(REASON)
226 normal 43 any bash sila
R 0:02 3 12 n[101-103]
Here, slurm allocates the right number of processors (12) but the number
of allocated nodes is wrong : 3 instead of 4. Then, there will be 2
tasks on the same node (n101):
> srun hostname
n102
n101
n101
n103
Again, when specifying --tasks-per-node=1, the behaviour is correct:
> salloc -n4 -c3 --tasks-per-node=1
salloc: Granted job allocation 230
> squeue
JOBID QOS PRIORITY PARTITION NAME USER
ST TIME NODES CPUS NODELIST(REASON)
230 normal 43 any bash sila
R 0:03 4 16 n[101-104]
Note that 16 processors have been allocated instead of 12 but this is
correct because slurm is configured with the CR_Socket consumable
resource (each node has got 1 socket and 4 cores/socket). The srun
command is as expected :
sila at master2-l422:~> srun hostname
n101
n102
n103
n104
3) Conclusion and fix :
I thought that the --tasks-per-node should not be mandatory to obtain
the right behaviour so I did some investigations.
I think that a bug has been introduced when unifying allocation code for
CR_Socket and CR_Core (commit 6fa3d5ad) in the Step 3 of the
_allocate_sc() function (src/plugins/select/cons_res/job_test.c) when
computing the avail_cpus:
src/plugins/select/cons_res/job_test.c, _allocate_sc(...):
if (cpus_per_task < 2) {
avail_cpus = num_tasks;
} else if ((ntasks_per_core == 1) &&
(cpus_per_task > threads_per_core)) {
/* find out how many cores a task will use */
int task_cores = (cpus_per_task + threads_per_core - 1) /
threads_per_core;
int task_cpus = task_cores * threads_per_core;
/* find out how many tasks can fit on a node */
int tasks = avail_cpus / task_cpus;
/* how many cpus the the job would use on the node */
avail_cpus = tasks * task_cpus;
/* subtract out the extra cpus. */
avail_cpus -= (tasks * (task_cpus - cpus_per_task));
} else {
j = avail_cpus / cpus_per_task;
num_tasks = MIN(num_tasks, j);
if (job_ptr->details->ntasks_per_node) <- problem
avail_cpus = num_tasks * cpus_per_task;
}
The 'if (job_ptr->details->ntasks_per_node)' condition marked above as
'problem' prevents the avail_cpus to be correctly computed when
--ntasks_per_node is NOT specified (and cpus_per_task>1). Before
unifying the _allocate_sockets and _allocate_cores functions in the
_allocate_sc function, this condition was only present in the
_allocate_socket() function code. It appears that it is unnecessary in
the _allocate_sc() code and 'avail_cpus' should be computed without any
condition.
With the following patch, the slurm controller with select/cons_res
(CR_CORE and CR_SOCKET) is doing his job correctly when allocating with
--cpus-per-task (-c) without the need of specifying the
--ntasks_per_node option:
diff --git a/src/plugins/select/cons_res/job_test.c
b/src/plugins/select/cons_res/job_test.c
index 25e0b8875b..4e704e8b65 100644
--- a/src/plugins/select/cons_res/job_test.c
+++ b/src/plugins/select/cons_res/job_test.c
@@ -474,8 +474,7 @@ static uint16_t _allocate_sc(struct job_record
*job_ptr, bitstr_t *core_map,
} else {
j = avail_cpus / cpus_per_task;
num_tasks = MIN(num_tasks, j);
- if (job_ptr->details->ntasks_per_node)
- avail_cpus = num_tasks * cpus_per_task;
+ avail_cpus = num_tasks * cpus_per_task;
}
if ((job_ptr->details->ntasks_per_node &&
Test results OK after applying the patch:
1) CR_CORE:
> scontrol show conf|grep -i select
SelectType = select/cons_res
SelectTypeParameters = CR_CORE
> salloc -n4 -c3
salloc: Granted job allocation 234
> squeue
JOBID QOS PRIORITY PARTITION NAME USER
ST TIME NODES CPUS NODELIST(REASON)
234 normal 43 any bash sila
R 0:02 4 12 n[101-104]
2) CR_SOCKET:
> scontrol show conf|grep -i select
SelectType = select/cons_res
SelectTypeParameters = CR_SOCKET
> salloc -n4 -c3
salloc: Granted job allocation 233
> squeue
JOBID QOS PRIORITY PARTITION NAME USER
ST TIME NODES CPUS NODELIST(REASON)
233 normal 43 any bash sila
R 0:03 4 16 n[101-104]
What do you think?
Best regards,
Didier
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180905/5caa5385/attachment-0001.html>
More information about the slurm-users
mailing list