Hello,
We have upgraded our cluster to Slurm 23.11.1 then, a few weeks later, to 23.11.4. Since then, Slurm doesn't detect hyperthreaded CPUs. We have downgraded our test cluster, the issue is not present with Slurm 22.05 (we had skipped Slurm 23.02).
For example, we are working with this node:
$ slurmd -C NodeName=node03 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128215
It is defined like this in slurm.conf:
SelectTypeParameters=CR_CPU_Memory TaskPlugin=task/cgroup,task/affinity NodeName=node03 CPUs=40 RealMemory=150000 Feature=htc MemSpecLimit=5000 NodeSet=htc Feature=htc PartitionName=htc Default=YES MinNodes=0 MaxNodes=1 Nodes=htc DefMemPerCPU=1000 State=UP LLN=Yes MaxMemPerNode=142000
So no oversubscribing, 20 cores and 40 CPUs thanks to hyperthreading. Until the updgrade, Slurm was allocating those 40 CPUs: when launching 40 jobs of 1 CPU, each of those job would use one 1 CPU. This is the expected behavior.
Since the upgrade, we can still launch those 40 jobs, but only the first half of the CPUs will be used (CPUs 0 to 19 according to htop). Each of those CPUs is used by 2 jobs, and the second half of the CPUs (#20 to 39) stay completely idle. When launching 40 stress processes directly in the node without using Slurm all the CPUs are used.
When allocating a specific CPU with srun, it works until CPU #19 and then an error occurs even if the allocation includes all the CPUs of the node:
#SBATCH --ntasks=1 #SBATCH --cpus-per-task=40 # Works for 0 to 19 srun --cpu-bind=v,map_cpu:19 stress.py
# Doesn't work (20 to 39) srun --cpu-bind=v,map_cpu:20 stress.py # Output: srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x00000FFFFF. srun: error: Task launch for StepId=57194.0 failed on node node03: Unable to satisfy cpu bind request srun: error: Application launch failed: Unable to satisfy cpu bind request srun: Job step aborted
This behaviour concerns all our nodes, some of which have been restarted recently and others have not. This causes the jobs to be frequently interrupted, augmenting the difference between the system real time and user+system times and making the jobs slower. We have been peering the documentation but, from what we understand, our configuration seems correct. In particular, as advised by the documentation[1], we don't set up ThreadsPerCore in slurm.conf.
Are we missing something, or is there a regression or a change in Slurm configuration since the version 23.11?
Thank you, Guillaume
[1] : https://slurm.schedmd.com/slurm.conf.html#OPT_ThreadsPerCore
Hi, Please review the settings in slurm.conf for oversubscribe for cpu cores and setting jobs to use oversubscribe in sbatch. I don't know if it is still true, but delete the boards=1 from node definition. It used to mess up the math. Doug
On Wed, Mar 27, 2024, 7:09 AM Guillaume COCHARD via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hello,
We have upgraded our cluster to Slurm 23.11.1 then, a few weeks later, to 23.11.4. Since then, Slurm doesn't detect hyperthreaded CPUs. We have downgraded our test cluster, the issue is not present with Slurm 22.05 (we had skipped Slurm 23.02).
For example, we are working with this node:
$ slurmd -C NodeName=node03 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128215
It is defined like this in slurm.conf:
SelectTypeParameters=CR_CPU_Memory TaskPlugin=task/cgroup,task/affinity NodeName=node03 CPUs=40 RealMemory=150000 Feature=htc MemSpecLimit=5000 NodeSet=htc Feature=htc PartitionName=htc Default=YES MinNodes=0 MaxNodes=1 Nodes=htc DefMemPerCPU=1000 State=UP LLN=Yes MaxMemPerNode=142000
So no oversubscribing, 20 cores and 40 CPUs thanks to hyperthreading. Until the updgrade, Slurm was allocating those 40 CPUs: when launching 40 jobs of 1 CPU, each of those job would use one 1 CPU. This is the expected behavior.
Since the upgrade, we can still launch those 40 jobs, but only the first half of the CPUs will be used (CPUs 0 to 19 according to htop). Each of those CPUs is used by 2 jobs, and the second half of the CPUs (#20 to 39) stay completely idle. When launching 40 stress processes directly in the node without using Slurm all the CPUs are used.
When allocating a specific CPU with srun, it works until CPU #19 and then an error occurs even if the allocation includes all the CPUs of the node:
#SBATCH --ntasks=1 #SBATCH --cpus-per-task=40 # Works for 0 to 19 srun --cpu-bind=v,map_cpu:19 stress.py
# Doesn't work (20 to 39) srun --cpu-bind=v,map_cpu:20 stress.py # Output: srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x00000FFFFF. srun: error: Task launch for StepId=57194.0 failed on node node03: Unable to satisfy cpu bind request srun: error: Application launch failed: Unable to satisfy cpu bind request srun: Job step aborted
This behaviour concerns all our nodes, some of which have been restarted recently and others have not. This causes the jobs to be frequently interrupted, augmenting the difference between the system real time and user+system times and making the jobs slower. We have been peering the documentation but, from what we understand, our configuration seems correct. In particular, as advised by the documentation[1], we don't set up ThreadsPerCore in slurm.conf.
Are we missing something, or is there a regression or a change in Slurm configuration since the version 23.11?
Thank you, Guillaume
[1] : https://slurm.schedmd.com/slurm.conf.html#OPT_ThreadsPerCore
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com