[slurm-users] question about hyperthreaded CPUS, --hint=nomultithread and mutli-jobstep jobs

Hans van Schoot vanschoot at scm.com
Tue May 23 09:20:18 UTC 2023

Hi all,

I am getting some unexpected behavior with SLURM on a multithreaded CPU 
(AMD Ryzen 7950X), in combination with a job that uses multiple jobsteps 
and a program that prefers to run without hyperthreading.

My job consists of a simple shell script that does multiple srun 
executions, and normally (on non-multithreaded nodes) the srun commands 
will only start when resources are available inside my allocation. Example:

sbatch -N 1 -n 16 mytestjob.sh

mytestjob.sh contains:
srun -n 8 someMPIprog &
srun -n 8 someMPIprog &
srun -n 8 someMPIprog &
srun -n 8 someMPIprog &

srun 1 and 2 will start immediately, srun 3 will start as soon as one of 
the first two jobsteps is finished, and srun 4 will again wait until 
some cores are available.

Now I would like this same behavior (no multithreading, one task per 
core) on a node with 16 multithreaded cores (32 cpus in SLURM 
ThreadsPerCore=2), so I submit with the following command:
sbatch -N 1 --hint=nomultithread -n 16 mytestjob.sh

Slurm correctly reserves the whole node for this, and srun without 
additional directions would launch someMPIprog with 16 MPI ranks.
Unfortunately in the multi jobstep situation, this causes all four srun 
iterations to start immediately, resulting in 4x 8 MPI ranks running at 
the same time, and thus multithreading. As I specified 
--hint=nomultithread, I would have expected the same behaviour as on the 
no-multithreaded node: srun 1 and 2 launch directly, and srun 3 and 4 
wait for CPU resources to become available.

So far I've been able to find two hacky ways of getting around this problem:
- do not use --hint=nomultithread, and instead limit using memory 
(--mem-per-cpu=4000). This is a bit ugly: it reserves half the compute 
node and seems to bind to the wrong CPU cores.
- set --cpus-per-task=2 instead of --hint=nomultithread, but this causes 
OpenMP to kick in if the MPI program supports it.

To me this feels like a bit of a bug in SLURM: I tell it not to 
multithread, but it still schedules jobsteps that cause the CPU to 

Is there another way of getting the non-multithreaded behavior without 
disabling multithreading in BIOS?

Best regards and many thanks in advance!
Hans van Schoot

Some additional information:
- I'm running slurm 18.08.4
- This is my node configuration in scontrol:
scontrol show nodes compute-7-0
NodeName=compute-7-0 Arch=x86_64 CoresPerSocket=16
    CPUAlloc=0 CPUTot=32 CPULoad=1.00
    NodeAddr= NodeHostName=compute-7-0 Version=18.08
    OS=Linux 6.1.8-1.el7.elrepo.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Jan 23 
12:57:27 EST 2023
    RealMemory=64051 AllocMem=0 FreeMem=62681 Sockets=1 Boards=1
    State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=20511200 Owner=N/A 
    BootTime=2023-04-04T14:14:55 SlurmdStartTime=2023-04-17T12:32:38
    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

More information about the slurm-users mailing list