[slurm-users] srun jobfarming hassle question

Wed Jan 18 12:05:57 UTC 2023

Dear Colleagues,

already for quite some years now are we again and again facing issues on our clusters with so-called job-farming (or task-farming) concepts in Slurm jobs using srun. And it bothers me that we can hardly help users with requests in this regard.

>From the documentation (https://slurm.schedmd.com/srun.html#SECTION_EXAMPLES), it reads like this.

------------------------------------------->

...

#SBATCH --nodes=??

...

srun -N 1 -n 2 ... prog1 &> log.1 &

srun -N 1 -n 1 ... prog2 &> log.2 &

...

wait

------------------------------------------->

should do it, meaning that as many job steps are created and reasonably placed to the resources/slots available in the job allocation.

Well, this does not work so really on our clusters. (I'm afraid that I'm just too idiotic to use srun here ... )

As long as complete nodes are used, and regular task-per-node/cpus-per-task pattern, everything is still manageable. Task and thread placement using srun is sometimes still some burden.

But if I want rather free resource specifications, like in the example below (with half a node or so), I simply fail to get the desired result.

Ok. We've Haswell nodes with 2 sockets, and each socket has 2 NUMA domains with each 7 CPUs. 28 physical cores in total. 56 with Hyperthreading, such that the logical CPUs are as follows.

socket   phys, CPU        logic. CPU

0            0                        0,28

0            1                        1,29

0            2                        2,30

...

1            0                       14,42

...

1            13                      27,55

(slurm.conf is attached ... essential is "cm2_inter" partition of "inter" cluster)

So, for instance, for an OpenMP-only program, I'd like to place 14 OMP threads on the 1st socket, another step with 14 OMP threads on the 2nd socket (of first node), etc.

------------------------------------------->

#!/bin/bash
#SBATCH -J jobfarm_test
#SBATCH -o log.%x.%j.%N.out
#SBATCH -D ./
#SBATCH --mail-type=NONE
#SBATCH --time=00:10:00
#SBATCH --export=NONE
#SBATCH --get-user-env
#SBATCH --clusters=cm2
#SBATCH --partition=cm2_std
#SBATCH --nodes=2
module load slurm_setup       # specific to LRZ cluster

export OMP_NUM_THREADS=14
placement=/lrz/sys/tools/placement_test_2021/bin/placement-test.omp_only
srun_opts="-N 1 -n 1 -c $((2*$OMP_NUM_THREADS)) --mpi=none --exclusive --export=ALL --cpu_bind=verbose,cores"

for i in $(seq 1 4); do
   srun $srun_opts $placement -d 10 &> log.$i &
done
wait

------------------------------------------->

placemen-test.omp_only is just an OMP executable, where each thread reads the /proc/... info about ttid and cpu it is running on an prints it screen (and the sleeps in order to persist longer on the cpu in running state).

With the script above, I assumed that this would let run all 4 srun-steps at the same time - on each socket one. But it doesn't.

First of all, due to Hyptherthreading, I must specify here "-c 56". If I would use "-c 28" (which would be more intuitive to me), the CPUs 0-6,28-32 are used (the first NUMA domain). And also then if I use -c 28 or even -c 14, the steps don't run at the same time on a node. Only a single step per node at a time.

Removing "--exclusive" doesn't change anything. --cpu_bind to socket doesn't have an effect either (here I already shoot into the blue).

I want to avoid more stringent requirements (something like memory), as to admit sharing the complete available memory on a node. But even if I reduced memory requirement per CPU (ridiculous 1K) does not change anything.

So something I definitely do wrong. But I can't even guess, what? I tried really many options. Also -m and -B options. Without success. Complexity is killing me here.

>From SchedMD documentation, I assume it shouldn't be so complicated as to use the low level placement options

  - as they described here https://slurm.schedmd.com/mc_support.html

Has someone any some clue on how to use srun for these purposes?

But if not, I would also be glad to learn about alternatives ... if they are as convenient as SchedMD promised with https://slurm.schedmd.com/srun.html#SECTION_EXAMPLES

Then I can get rid of srun ... maybe. (In my desperation, I even tried GNU parallel with SSH process spawning ... :( Fazit: It is not really convenient for that purpose.)

Thank you very much in advance!

Kind regards,

Martin

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230118/9f2fc73c/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slurm.conf
Type: application/octet-stream
Size: 5513 bytes
Desc: slurm.conf
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20230118/9f2fc73c/attachment.obj>