Hi Everyone,
We have a SLURM cluster of three different types of nodes. One
partition consists of nodes that have a large number of CPUs, 256 CPUs on
each node.
I'm trying to find out the current CPU allocation on some of those nodes
but part of the information I gathered seems to be incorrect. If I use
"*scontrol
show node <node-name>*", I get this for the CPU info:
RealMemory=450000 AllocMem=262144 FreeMem=235397 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
CPUAlloc=256 CPUEfctv=256 CPUTot=256 CPULoad=126.65
CfgTRES=cpu=256,mem=450000M,billing=256
AllocTRES=cpu=256,mem=256G
However, when I tried to identify those jobs to which the node's CPUs have
been allocated, and get a tally of the allocated CPUs, I can only see 128
CPUs that are effectively allocated on that node, based on the output
of *squeue
--state=R -o "%C %N".* So I don't quite understand why the running jobs on
the nodes account for just 128, and not 256, CPU allocation even though
scontrol reports 100% CPU allocation on the node. Could this be due to some
misconfiguration, or a bug in the SLURM version we're running? We're
running Version=23.02.4. The interesting thing is that we have six nodes
that have similar specs, and all of them show up as allocated in the output
of *sinfo*, but the running jobs on each node account for just 128 CPU
allocation, as if they're all capped at 128.
Any thoughts, suggestions or assistance to figure this out would be greatly
appreciated.
Thanks,
Muhammad