Hi,
Please review the settings in slurm.conf for oversubscribe for cpu cores and setting jobs to use oversubscribe in sbatch. I don't know if it is still true, but delete the boards=1 from node definition. It used to mess up the math. 
Doug

On Wed, Mar 27, 2024, 7:09 AM Guillaume COCHARD via slurm-users <slurm-users@lists.schedmd.com> wrote:
Hello,

We have upgraded our cluster to Slurm 23.11.1 then, a few weeks later, to 23.11.4. Since then, Slurm doesn't detect hyperthreaded CPUs. We have downgraded our test cluster, the issue is not present with Slurm 22.05 (we had skipped Slurm 23.02).

For example, we are working with this node:

$ slurmd -C
NodeName=node03 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 ThreadsPerCore=2 RealMemory=128215

It is defined like this in slurm.conf:

SelectTypeParameters=CR_CPU_Memory
TaskPlugin=task/cgroup,task/affinity
NodeName=node03 CPUs=40 RealMemory=150000 Feature=htc MemSpecLimit=5000
NodeSet=htc Feature=htc
PartitionName=htc Default=YES MinNodes=0 MaxNodes=1 Nodes=htc DefMemPerCPU=1000 State=UP LLN=Yes MaxMemPerNode=142000

So no oversubscribing, 20 cores and 40 CPUs thanks to hyperthreading. Until the updgrade, Slurm was allocating those 40 CPUs: when launching 40 jobs of 1 CPU, each of those job would use one 1 CPU. This is the expected behavior.

Since the upgrade, we can still launch those 40 jobs, but only the first half of the CPUs will be used (CPUs 0 to 19 according to htop). Each of those CPUs is used by 2 jobs, and the second half of the CPUs (#20 to 39) stay completely idle. When launching 40 stress processes directly in the node without using Slurm all the CPUs are used.

When allocating a specific CPU with srun, it works until CPU #19 and then an error occurs even if the allocation includes all the CPUs of the node:

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=40
# Works for 0 to 19
srun --cpu-bind=v,map_cpu:19 stress.py

# Doesn't work (20 to 39)
srun --cpu-bind=v,map_cpu:20 stress.py
# Output:
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x00000FFFFF.
srun: error: Task launch for StepId=57194.0 failed on node node03: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted

This behaviour concerns all our nodes, some of which have been restarted recently and others have not. This causes the jobs to be frequently interrupted, augmenting the difference between the system real time and user+system times and making the jobs slower. We have been peering the documentation but, from what we understand, our configuration seems correct. In particular, as advised by the documentation[1], we don't set up ThreadsPerCore in slurm.conf.

Are we missing something, or is there a regression or a change in Slurm configuration since the version 23.11?

Thank you,
Guillaume

[1] : https://slurm.schedmd.com/slurm.conf.html#OPT_ThreadsPerCore

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-leave@lists.schedmd.com