[slurm-users] Meaning of --cpus-per-task and --mem-per-cpu when SMT processors are used
Marcus Wagner
wagner at itc.rwth-aachen.de
Thu Mar 5 15:14:15 UTC 2020
Hi Alexander,
could you please do a
scontrol show config | grep SelectTypeParameters
and tell us the result?
In fact, for SLURM a CPU is everytimes a CPU, nonetheless, if a thread
(with HT) or a core is meant(without HT).
The question is moreover, why SLURM thinks, such a node is not available.
We sometimes also have this phenomenon, we have to restart the
slurmcontrolloer to solve that.
But I would first like to see, what
sbatch -vvv jobscript
outputs first. I'm not sure, if it would be meaningful, if the jobs does
not get submitted, but it might be a try.
Best
Marcus
On 3/4/20 1:25 PM, Alexander Grund wrote:
> > What is your hardware configuration? Do you have 1 server with 44
> processor sockets, and each processor has 4 CPU cores? Or is it maybe
> 1 server with 1 or more sockets for a total of 44 CPU cores, and each
> CPU core is running 4 hyperthreads?
>
> 1 server, 2 sockets, 22 cores each, 4 hyperthreads --> 2*22*4=176
> "CPUTot" as reported by "scontrol show node"
>
> > I think you should give the relevant node and partition lines from
> your slurm.conf.
>
> I found the following in node.conf: NodeName=taurusml[1-32] Feature=IB
> Gres=gpu:6 Procs=176 Sockets=2 CoresPerSocket=22 ThreadsPerCore=4
> RealMemory=254000 State=UNKNOWN Weight=128
>
> > Which Slurm version do you run?
>
> 19.05.5
>
> > The whypending tool does not appear in a google search. Where did
> you get it from and what does it do?
>
> It seems to be a Python script showing why a job is pending. It uses
> pyslurm. I thought it was a slurm tool, but might be some custom thing
>
> > >Most importantly: Does this mean `--cpus-per-task` can be as high
> as 176 on this node and `--mem-per-cpu` can be up to the reported
> "RealMemory"/176?
> > Yes.
>
> > This is just historical as far as I can tell. I think 'CPU' almost
> always means 'core'.
>
> I just tried a very simple example with 1 task and
> `--cpus-per-task=50` (slightly higher than the 44 physical cores) and
> it failed with "Requested node configuration is not available"
>
>
> So in summary: "CPU" for the srun/sbatch/salloc means "(physical)
> core". "CPU" as for scontrol (and pyslurm which seems to wrap this)
> means "Thread". This is confusing but at least the question seems to
> be answered now.
>
--
Marcus Wagner, Dipl.-Inf.
IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de
More information about the slurm-users
mailing list