[slurm-users] Slurm on POWER9
bipcuds at gmail.com
Mon Sep 10 15:46:14 MDT 2018
We have installed slurm 17.11.8 on IBM AC922 nodes (POWER9) that have 4
GPUs each, and are running RHEL 7.5-ALT. Physically, these are 2-socket
nodes, with each socket having 20 cores. Depending on SMT setting (SMT1,
SMT2, SMT4) there can be 40, 80, or 160 "processors/CPUs" virtually.
Some problems/oddities we have seen are:
1.) Slurm seems to be incapable of recognizing sockets/cores/threads on
these systems. If I am usign SMT2, for instance, this node definition will
NodeName=c[1-12] CoresPerSocket=20 RealMemory=583992 Sockets=2
while this line works (i.e. slurm thinks there all "virtual threads/CPUs"
are a single-core, single-thread socket):
NodeName=c[1-12] CoresPerSocket=1 RealMemory=583992 Sockets=80
If I am using SMT4, "Sockets=160" is accepted.
Anyone know if there is a way to get Slurm to recognize the true topology
for POWER nodes?
2.) Another concern is the gres.conf. Slurm seems to have trouble taking
processor ID's that are > "#Sockets". The true processor ID as given by
nvidia-smi topo -m output will range up to 159, and slurm doesn't like
this. Are we to use "Cores=" entries in gres.conf, and use the number of
the physical cores, instead of what nvidia-smi outputs?
3.) A related gres.conf question: there seems to be no documentation of
using "CPUs=" instead of "Cores=", yet I have seen several online examples
using "CPUs=" (and I myself have used it on an x86 system without issue).
Should one use "Cores" instead of "CPUs", when specifying binding to
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the slurm-users