<div dir="ltr">Hello all,<br><div><br></div><div>Thanks for the useful observations. Here is some further env vars:</div><div><br></div><div># non problematic case </div><div>$ srun -c 3 --partition=gpu-2080ti env<br><br>SRUN_DEBUG=3<br>SLURM_JOB_CPUS_PER_NODE=4<br>SLURM_NTASKS=1<br>SLURM_NPROCS=1<br>SLURM_CPUS_PER_TASK=3<br>SLURM_STEP_ID=0<br>SLURM_STEPID=0<br>SLURM_NNODES=1<br>SLURM_JOB_NUM_NODES=1<br>SLURM_STEP_NUM_NODES=1<br>SLURM_STEP_NUM_TASKS=1<br>SLURM_STEP_TASKS_PER_NODE=1<br>SLURM_CPUS_ON_NODE=4<br>SLURM_NODEID=0<br><b>SLURM_PROCID=0<br>SLURM_LOCALID=0<br>SLURM_GTIDS=0</b><br><br><br># problematic case - prints two sets of env vars<br>$ srun -c 1 --partition=gpu-2080ti env<br><br>SRUN_DEBUG=3<br>SLURM_JOB_CPUS_PER_NODE=2<br>SLURM_NTASKS=2<br>SLURM_NPROCS=2<br>SLURM_CPUS_PER_TASK=1<br>SLURM_STEP_ID=0<br>SLURM_STEPID=0<br>SLURM_NNODES=1<br>SLURM_JOB_NUM_NODES=1<br>SLURM_STEP_NUM_NODES=1<br>SLURM_STEP_NUM_TASKS=2<br>SLURM_STEP_TASKS_PER_NODE=2<br>SLURM_CPUS_ON_NODE=2<br>SLURM_NODEID=0<br><b>SLURM_PROCID=0<br>SLURM_LOCALID=0</b><br><b>SLURM_GTIDS=0,1<br></b><br><br>SRUN_DEBUG=3<br>SLURM_JOB_CPUS_PER_NODE=2<br>SLURM_NTASKS=2<br>SLURM_NPROCS=2<br>SLURM_CPUS_PER_TASK=1<br>SLURM_STEP_ID=0<br>SLURM_STEPID=0<br>SLURM_NNODES=1<br>SLURM_JOB_NUM_NODES=1<br>SLURM_STEP_NUM_NODES=1<br>SLURM_STEP_NUM_TASKS=2<br>SLURM_STEP_TASKS_PER_NODE=2<br>SLURM_CPUS_ON_NODE=2<br>SLURM_NODEID=0<br><b>SLURM_PROCID=1<br>SLURM_LOCALID=1<br>SLURM_GTIDS=0,1<br></b><br>Please see the ones in bold. @Hermann Schwärzler how do you plan to manage this bug? We have currently set SLURM_NTASKS_PER_NODE=1 clusterwide.<br><br>Best,</div><div>Durai<br><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Mar 25, 2022 at 12:45 PM Juergen Salk <<a href="mailto:juergen.salk@uni-ulm.de" target="_blank">juergen.salk@uni-ulm.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Bjørn-Helge,<br>

<br>

that's very similar to what we did as well in order to avoid confusion with<br>

Core vs. Threads vs. CPU counts when Hyperthreading is kept enabled in the<br>

BIOS. <br>

<br>

Adding CPUs=<core_count> (not <thread_count>) will tell Slurm to only <br>

schedule physical cores. <br>

<br>

We have <br>

<br>

SelectType=select/cons_res<br>

SelectTypeParameters=CR_Core_Memory<br>

<br>

and<br>

<br>

NodeName=DEFAULT CPUs=48 Sockets=2 CoresPerSocket=24 ThreadsPerCore=2 <br>

<br>

This is for compute nodes that have 2 sockets, 2 x 24 physical cores<br>

with hyperthreading enabled in the BIOS. (Although, in general, we do<br>

not encourage our users to make use of hyperthreading, we have decided<br>

to leave it enabled in the BIOS as there are some corner cases that<br>

are known to benefit from hyperthreading.)<br>

<br>

With this setting Slurm does also show the total physical core<br>

counts instead of the thread counts and also treats the --mem-per-cpu<br>

option as "--mem-per-core" which is in our case what most of our users<br>

expect.<br>

<br>

As to the number of tasks spawned with `--cpus-per-task=1´, I think this <br>

is intended behavior. The following sentence from the srun manpage is<br>

probably relevant:<br>

<br>

-c, --cpus-per-task=<ncpus><br>

<br>

  If -c is specified without -n, as many tasks will be allocated per<br>

  node as possible while satisfying the -c restriction.<br>

<br>

In our configuration, we allow multiple jobs to run for the same user<br>

on a node (ExclusiveUser=yes) and we get <br>

<br>

$ srun -c 1 echo foo | wc -l<br>

1<br>

$<br>

<br>

However, in case of CPUs=<thread_count> instead of CPUs=<core_count>,<br>

I guess, this would have been 2 lines of output, because the smallest<br>

unit to schedule for a job is 1 physical core which allows 2 tasks to<br>

run with hyperthreading enabled. <br>

<br>

In case of exclusive node allocation for jobs (i.e. no node<br>

sharing allowed) Slurm would give all cores of a node to the job <br>

which allows even more tasks to be spawned:<br>

<br>

$ srun --exclusive -c 1 echo foo | wc -l<br>

48<br>

$<br>

<br>

48 lines correspond exactly to the number of physical cores on the<br>

node. Again, with CPUs=<thread_count> instead of CPUs=<core_count>, I<br>

would expect 2 x 48 = 96 lines of output, but I did not test that. <br>

<br>

Best regards<br>

Jürgen<br>

<br>

<br>

* Bjørn-Helge Mevik <<a href="mailto:b.h.mevik@usit.uio.no" target="_blank">b.h.mevik@usit.uio.no</a>> [220325 08:49]:<br>

> For what it's worth, we have a similar setup, with one crucial<br>

> difference: we are handing out physical cores to jobs, not hyperthreads,<br>

> and we are *not* seeing this behaviour:<br>

> <br>

> $ srun --cpus-per-task=1 -t 10 --mem-per-cpu=1g -A nn9999k -q devel echo foo<br>

> srun: job 5371678 queued and waiting for resources<br>

> srun: job 5371678 has been allocated resources<br>

> foo<br>

> $ srun --cpus-per-task=3 -t 10 --mem-per-cpu=1g -A nn9999k -q devel echo foo<br>

> srun: job 5371680 queued and waiting for resources<br>

> srun: job 5371680 has been allocated resources<br>

> foo<br>

> <br>

> We have<br>

> <br>

> SelectType=select/cons_tres<br>

> SelectTypeParameters=CR_CPU_Memory<br>

> <br>

> and node definitions like<br>

> <br>

> NodeName=DEFAULT CPUs=40 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=182784 Gres=localscratch:330G Weight=1000<br>

> <br>

> (so we set CPUs to the number of *physical cores*, not *hyperthreads*).<br>

> <br>

> -- <br>

> Regards,<br>

> Bjørn-Helge Mevik, dr. scient,<br>

> Department for Research Computing, University of Oslo<br>

> <br>

<br>

<br>

<br>

-- <br>

Jürgen Salk<br>

Scientific Software & Compute Services (SSCS)<br>

Kommunikations- und Informationszentrum (kiz)<br>

Universität Ulm<br>

Telefon: +49 (0)731 50-22478<br>

Telefax: +49 (0)731 50-22471<br>

<br>

</blockquote></div>