<div dir="ltr">Hello all,<br><div><br></div><div>Thanks for the useful observations. Here is some further env vars:</div><div><br></div><div># non problematic case </div><div>$ srun -c 3 --partition=gpu-2080ti env<br><br>SRUN_DEBUG=3<br>SLURM_JOB_CPUS_PER_NODE=4<br>SLURM_NTASKS=1<br>SLURM_NPROCS=1<br>SLURM_CPUS_PER_TASK=3<br>SLURM_STEP_ID=0<br>SLURM_STEPID=0<br>SLURM_NNODES=1<br>SLURM_JOB_NUM_NODES=1<br>SLURM_STEP_NUM_NODES=1<br>SLURM_STEP_NUM_TASKS=1<br>SLURM_STEP_TASKS_PER_NODE=1<br>SLURM_CPUS_ON_NODE=4<br>SLURM_NODEID=0<br><b>SLURM_PROCID=0<br>SLURM_LOCALID=0<br>SLURM_GTIDS=0</b><br><br><br># problematic case - prints two sets of env vars<br>$ srun -c 1 --partition=gpu-2080ti env<br><br>SRUN_DEBUG=3<br>SLURM_JOB_CPUS_PER_NODE=2<br>SLURM_NTASKS=2<br>SLURM_NPROCS=2<br>SLURM_CPUS_PER_TASK=1<br>SLURM_STEP_ID=0<br>SLURM_STEPID=0<br>SLURM_NNODES=1<br>SLURM_JOB_NUM_NODES=1<br>SLURM_STEP_NUM_NODES=1<br>SLURM_STEP_NUM_TASKS=2<br>SLURM_STEP_TASKS_PER_NODE=2<br>SLURM_CPUS_ON_NODE=2<br>SLURM_NODEID=0<br><b>SLURM_PROCID=0<br>SLURM_LOCALID=0</b><br><b>SLURM_GTIDS=0,1<br></b><br><br>SRUN_DEBUG=3<br>SLURM_JOB_CPUS_PER_NODE=2<br>SLURM_NTASKS=2<br>SLURM_NPROCS=2<br>SLURM_CPUS_PER_TASK=1<br>SLURM_STEP_ID=0<br>SLURM_STEPID=0<br>SLURM_NNODES=1<br>SLURM_JOB_NUM_NODES=1<br>SLURM_STEP_NUM_NODES=1<br>SLURM_STEP_NUM_TASKS=2<br>SLURM_STEP_TASKS_PER_NODE=2<br>SLURM_CPUS_ON_NODE=2<br>SLURM_NODEID=0<br><b>SLURM_PROCID=1<br>SLURM_LOCALID=1<br>SLURM_GTIDS=0,1<br></b><br>Please see the ones in bold. @Hermann Schwärzler how do you plan to manage this bug? We have currently set SLURM_NTASKS_PER_NODE=1 clusterwide.<br><br>Best,</div><div>Durai<br><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Mar 25, 2022 at 12:45 PM Juergen Salk <<a href="mailto:juergen.salk@uni-ulm.de" target="_blank">juergen.salk@uni-ulm.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Bjørn-Helge,<br>
<br>
that's very similar to what we did as well in order to avoid confusion with<br>
Core vs. Threads vs. CPU counts when Hyperthreading is kept enabled in the<br>
BIOS. <br>
<br>
Adding CPUs=<core_count> (not <thread_count>) will tell Slurm to only <br>
schedule physical cores. <br>
<br>
We have <br>
<br>
SelectType=select/cons_res<br>
SelectTypeParameters=CR_Core_Memory<br>
<br>
and<br>
<br>
NodeName=DEFAULT CPUs=48 Sockets=2 CoresPerSocket=24 ThreadsPerCore=2 <br>
<br>
This is for compute nodes that have 2 sockets, 2 x 24 physical cores<br>
with hyperthreading enabled in the BIOS. (Although, in general, we do<br>
not encourage our users to make use of hyperthreading, we have decided<br>
to leave it enabled in the BIOS as there are some corner cases that<br>
are known to benefit from hyperthreading.)<br>
<br>
With this setting Slurm does also show the total physical core<br>
counts instead of the thread counts and also treats the --mem-per-cpu<br>
option as "--mem-per-core" which is in our case what most of our users<br>
expect.<br>
<br>
As to the number of tasks spawned with `--cpus-per-task=1´, I think this <br>
is intended behavior. The following sentence from the srun manpage is<br>
probably relevant:<br>
<br>
-c, --cpus-per-task=<ncpus><br>
<br>
If -c is specified without -n, as many tasks will be allocated per<br>
node as possible while satisfying the -c restriction.<br>
<br>
In our configuration, we allow multiple jobs to run for the same user<br>
on a node (ExclusiveUser=yes) and we get <br>
<br>
$ srun -c 1 echo foo | wc -l<br>
1<br>
$<br>
<br>
However, in case of CPUs=<thread_count> instead of CPUs=<core_count>,<br>
I guess, this would have been 2 lines of output, because the smallest<br>
unit to schedule for a job is 1 physical core which allows 2 tasks to<br>
run with hyperthreading enabled. <br>
<br>
In case of exclusive node allocation for jobs (i.e. no node<br>
sharing allowed) Slurm would give all cores of a node to the job <br>
which allows even more tasks to be spawned:<br>
<br>
$ srun --exclusive -c 1 echo foo | wc -l<br>
48<br>
$<br>
<br>
48 lines correspond exactly to the number of physical cores on the<br>
node. Again, with CPUs=<thread_count> instead of CPUs=<core_count>, I<br>
would expect 2 x 48 = 96 lines of output, but I did not test that. <br>
<br>
Best regards<br>
Jürgen<br>
<br>
<br>
* Bjørn-Helge Mevik <<a href="mailto:b.h.mevik@usit.uio.no" target="_blank">b.h.mevik@usit.uio.no</a>> [220325 08:49]:<br>
> For what it's worth, we have a similar setup, with one crucial<br>
> difference: we are handing out physical cores to jobs, not hyperthreads,<br>
> and we are *not* seeing this behaviour:<br>
> <br>
> $ srun --cpus-per-task=1 -t 10 --mem-per-cpu=1g -A nn9999k -q devel echo foo<br>
> srun: job 5371678 queued and waiting for resources<br>
> srun: job 5371678 has been allocated resources<br>
> foo<br>
> $ srun --cpus-per-task=3 -t 10 --mem-per-cpu=1g -A nn9999k -q devel echo foo<br>
> srun: job 5371680 queued and waiting for resources<br>
> srun: job 5371680 has been allocated resources<br>
> foo<br>
> <br>
> We have<br>
> <br>
> SelectType=select/cons_tres<br>
> SelectTypeParameters=CR_CPU_Memory<br>
> <br>
> and node definitions like<br>
> <br>
> NodeName=DEFAULT CPUs=40 Sockets=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=182784 Gres=localscratch:330G Weight=1000<br>
> <br>
> (so we set CPUs to the number of *physical cores*, not *hyperthreads*).<br>
> <br>
> -- <br>
> Regards,<br>
> Bjørn-Helge Mevik, dr. scient,<br>
> Department for Research Computing, University of Oslo<br>
> <br>
<br>
<br>
<br>
-- <br>
Jürgen Salk<br>
Scientific Software & Compute Services (SSCS)<br>
Kommunikations- und Informationszentrum (kiz)<br>
Universität Ulm<br>
Telefon: +49 (0)731 50-22478<br>
Telefax: +49 (0)731 50-22471<br>
<br>
</blockquote></div>