[slurm-users] Meaning of --cpus-per-task and --mem-per-cpu when SMT processors are used

Marcus Wagner wagner at itc.rwth-aachen.de
Thu Mar 5 15:14:15 UTC 2020


Hi Alexander,

could you please do a

scontrol show config | grep SelectTypeParameters

and tell us the result?


In fact, for SLURM a CPU is everytimes a CPU, nonetheless, if a thread 
(with HT) or a core is meant(without HT).
The question is moreover, why SLURM thinks, such a node is not available.
We sometimes also have this phenomenon, we have to restart the 
slurmcontrolloer to solve that.

But I would first like to see, what

sbatch -vvv jobscript

outputs first. I'm not sure, if it would be meaningful, if the jobs does 
not get submitted, but it might be a try.


Best
Marcus


On 3/4/20 1:25 PM, Alexander Grund wrote:
> > What is your hardware configuration?  Do you have 1 server with 44 
> processor sockets, and each processor has 4 CPU cores?  Or is it maybe 
> 1 server with 1 or more sockets for a total of 44 CPU cores, and each 
> CPU core is running 4 hyperthreads?
>
> 1 server, 2 sockets, 22 cores each, 4 hyperthreads --> 2*22*4=176 
> "CPUTot" as reported by "scontrol show node"
>
> > I think you should give the relevant node and partition lines from 
> your slurm.conf.
>
> I found the following in node.conf: NodeName=taurusml[1-32] Feature=IB 
> Gres=gpu:6 Procs=176 Sockets=2 CoresPerSocket=22 ThreadsPerCore=4 
> RealMemory=254000 State=UNKNOWN Weight=128
>
> > Which Slurm version do you run?
>
> 19.05.5
>
> > The whypending tool does not appear in a google search. Where did 
> you get it from and what does it do?
>
> It seems to be a Python script showing why a job is pending. It uses 
> pyslurm. I thought it was a slurm tool, but might be some custom thing
>
> > >Most importantly: Does this mean `--cpus-per-task` can be as high 
> as 176 on this node and `--mem-per-cpu` can be up to the reported 
> "RealMemory"/176?
> > Yes.
>
> > This is just historical as far as I can tell. I think 'CPU' almost 
> always means 'core'.
>
> I just tried a very simple example with 1 task and 
> `--cpus-per-task=50` (slightly higher than the 44 physical cores) and 
> it failed with "Requested node configuration is not available"
>
>
> So in summary: "CPU" for the srun/sbatch/salloc means "(physical) 
> core". "CPU" as for scontrol (and pyslurm which seems to wrap this) 
> means "Thread". This is confusing but at least the question seems to 
> be answered now.
>

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de




More information about the slurm-users mailing list