[slurm-users] Slurm on POWER9
Chris Samuel
chris at csamuel.org
Mon Sep 10 16:04:54 MDT 2018
Hi Keith,
On Tuesday, 11 September 2018 7:46:14 AM AEST Keith Ball wrote:
> 1.) Slurm seems to be incapable of recognizing sockets/cores/threads on
> these systems.
[...]
> Anyone know if there is a way to get Slurm to recognize the true topology
> for POWER nodes?
IIIRC Slurm uses hwloc for discovering topology, so "lstopo-no-graphics" might
give you some insights into whether it's showing you the right config.
I'd be curious to see what "lscpu" and "slurmd -C" say as well.
> 2.) Another concern is the gres.conf. Slurm seems to have trouble taking
> processor ID's that are > "#Sockets". The true processor ID as given by
> nvidia-smi topo -m output will range up to 159, and slurm doesn't like
> this. Are we to use "Cores=" entries in gres.conf, and use the number of
> the physical cores, instead of what nvidia-smi outputs?
Again I *think* Slurm is using hwloc's logical CPU numbering for this, so
lstopo should help - using a quick snippet on my local PC (HT enabled) here:
Package L#0 + L3 L#0 (8192KB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#4)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#1)
PU L#3 (P#5)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#2)
PU L#5 (P#6)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#3)
PU L#7 (P#7)
you can see that the logical numbering (L#0 and L#1) is done to be contiguous
compared to how the firmware has enumerated the CPUs.
> 3.) A related gres.conf question: there seems to be no documentation of
> using "CPUs=" instead of "Cores=", yet I have seen several online examples
> using "CPUs=" (and I myself have used it on an x86 system without issue).
> Should one use "Cores" instead of "CPUs", when specifying binding to
> specific GPUs?
I think CPUs= was the older syntax which has been replaced with Cores=.
The gres.conf we use on our HPC cluster uses Cores= quite happily.
Name=gpu Type=p100 File=/dev/nvidia0 Cores=0-17
Name=gpu Type=p100 File=/dev/nvidia1 Cores=18-35
All the best!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
More information about the slurm-users
mailing list