[slurm-users] problems using all cores (MPI) / cgroups / tasks problem

Thu Jan 14 15:35:29 UTC 2021

Hello All,

I've recently upgraded one of my testing systems to 20.11.2.

I seem to have a problem - to me it looks as if it's with cgroups, tasks 
& task affinity/binding - that I can't figure out.

What I'm seeing is this, in a nutshell:

[arc-login single]$ srun -M arc -p short --exclusive --pty /bin/bash
srun: job 67109169 queued and waiting for resources
srun: job 67109169 has been allocated resources
[arc-c001 single]$ numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
cpubind: 0 1
nodebind: 0 1
membind: 0 1

Which is what I'd expect on a 48 core node.

However, I'd expect asking for 48 tasks to get me more or less the same:

[arc-login single]$ srun -M arc -p short -n 48 --pty /bin/bash
srun: job 67109170 queued and waiting for resources
srun: job 67109170 has been allocated resources
[ouit0622 at arc-c001 single]$ numactl --show
policy: default
preferred node: current
physcpubind: 0
cpubind: 0
nodebind: 0
membind: 0 1

SLURM environment has SLURM_TASKS_PER_NODE=48 (and SLURM_CPU_BIND_LIST 
seems to have 48 CPUs), but the numactl looks like I'm restricted to one 
(or am I reading that wrong)?

Some more tinkering:

[arc-login single]$ srun -M arc -p short -n 1 --cpus-per-task 48 --pty 
/bin/bash
srun: job 67109171 queued and waiting for resources
srun: job 67109171 has been allocated resources
[ouit0622 at arc-c001 single]$ numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
cpubind: 0 1
nodebind: 0 1
membind: 0 1

[arc-login single]$ srun -M arc -p short -n 2 --cpus-per-task 24 --pty 
/bin/bash
srun: job 67109172 queued and waiting for resources
srun: job 67109172 has been allocated resources
[arc-c001 single]$ numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
cpubind: 0
nodebind: 0
membind: 0 1

Config (well, task related bits) has

[arc-c001 single]$ scontrol -M arc show config | grep -i task
MaxTasksPerNode         = 512
TaskEpilog              = (null)
TaskPlugin              = task/affinity,task/cgroup
TaskPluginParam         = (null type)
TaskProlog              = (null)
TaskAffinity            = no

Additionally (that's how I started investigating), MPI jobs don't seem 
to properly work - well, they only start one task on remote nodes - 
simple tests ('mpirun env' vs 'srun env') show discrepancies in things 
like CPU_BIND_LIST and SLURM_GTIDS (I can give more detail if needed).

I've seen https://bugs.schedmd.com/show_bug.cgi?id=10548 which seems to 
sound somewhat similar to what I'm seeing. But there doesn't seem to be 
much of a conclusion there?

Any ideas what this is / where this is going wrong / how to fix? I'm not 
just misunderstanding the 'sbatch' options entirely, am I?

I'm pretty sure things behaved as expected before the SLURM upgrade - we 
came from 18.08.3 if I'm not mistaken. I build RPMs from the tar ball; 
one option that crossed my mind is I might have missed a compile option 
or compile dependency (but I'm not sure which it would be if it were 
that - it's not as if the binding doesn't work at all.)

In short - am a bit stumped; any help welcome!

Tina

-- 
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator

Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk