[slurm-users] problems using all cores (MPI) / cgroups / tasks problem
Tina Friedrich
tina.friedrich at it.ox.ac.uk
Thu Jan 14 15:35:29 UTC 2021
Hello All,
I've recently upgraded one of my testing systems to 20.11.2.
I seem to have a problem - to me it looks as if it's with cgroups, tasks
& task affinity/binding - that I can't figure out.
What I'm seeing is this, in a nutshell:
[arc-login single]$ srun -M arc -p short --exclusive --pty /bin/bash
srun: job 67109169 queued and waiting for resources
srun: job 67109169 has been allocated resources
[arc-c001 single]$ numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
cpubind: 0 1
nodebind: 0 1
membind: 0 1
Which is what I'd expect on a 48 core node.
However, I'd expect asking for 48 tasks to get me more or less the same:
[arc-login single]$ srun -M arc -p short -n 48 --pty /bin/bash
srun: job 67109170 queued and waiting for resources
srun: job 67109170 has been allocated resources
[ouit0622 at arc-c001 single]$ numactl --show
policy: default
preferred node: current
physcpubind: 0
cpubind: 0
nodebind: 0
membind: 0 1
SLURM environment has SLURM_TASKS_PER_NODE=48 (and SLURM_CPU_BIND_LIST
seems to have 48 CPUs), but the numactl looks like I'm restricted to one
(or am I reading that wrong)?
Some more tinkering:
[arc-login single]$ srun -M arc -p short -n 1 --cpus-per-task 48 --pty
/bin/bash
srun: job 67109171 queued and waiting for resources
srun: job 67109171 has been allocated resources
[ouit0622 at arc-c001 single]$ numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
cpubind: 0 1
nodebind: 0 1
membind: 0 1
[arc-login single]$ srun -M arc -p short -n 2 --cpus-per-task 24 --pty
/bin/bash
srun: job 67109172 queued and waiting for resources
srun: job 67109172 has been allocated resources
[arc-c001 single]$ numactl --show
policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
cpubind: 0
nodebind: 0
membind: 0 1
Config (well, task related bits) has
[arc-c001 single]$ scontrol -M arc show config | grep -i task
MaxTasksPerNode = 512
TaskEpilog = (null)
TaskPlugin = task/affinity,task/cgroup
TaskPluginParam = (null type)
TaskProlog = (null)
TaskAffinity = no
Additionally (that's how I started investigating), MPI jobs don't seem
to properly work - well, they only start one task on remote nodes -
simple tests ('mpirun env' vs 'srun env') show discrepancies in things
like CPU_BIND_LIST and SLURM_GTIDS (I can give more detail if needed).
I've seen https://bugs.schedmd.com/show_bug.cgi?id=10548 which seems to
sound somewhat similar to what I'm seeing. But there doesn't seem to be
much of a conclusion there?
Any ideas what this is / where this is going wrong / how to fix? I'm not
just misunderstanding the 'sbatch' options entirely, am I?
I'm pretty sure things behaved as expected before the SLURM upgrade - we
came from 18.08.3 if I'm not mistaken. I build RPMs from the tar ball;
one option that crossed my mind is I might have missed a compile option
or compile dependency (but I'm not sure which it would be if it were
that - it's not as if the binding doesn't work at all.)
In short - am a bit stumped; any help welcome!
Tina
--
Tina Friedrich, Advanced Research Computing Snr HPC Systems Administrator
Research Computing and Support Services
IT Services, University of Oxford
http://www.arc.ox.ac.uk http://www.it.ox.ac.uk
More information about the slurm-users
mailing list