Hi Everyone,
We have a SLURM cluster of three different types of nodes. One partition consists of nodes that have a large number of CPUs, 256 CPUs on each node.
I'm trying to find out the current CPU allocation on some of those nodes but part of the information I gathered seems to be incorrect. If I use "*scontrol show node <node-name>*", I get this for the CPU info:
RealMemory=450000 AllocMem=262144 FreeMem=235397 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
CPUAlloc=256 CPUEfctv=256 CPUTot=256 CPULoad=126.65 CfgTRES=cpu=256,mem=450000M,billing=256
AllocTRES=cpu=256,mem=256G
However, when I tried to identify those jobs to which the node's CPUs have been allocated, and get a tally of the allocated CPUs, I can only see 128 CPUs that are effectively allocated on that node, based on the output of *squeue --state=R -o "%C %N".* So I don't quite understand why the running jobs on the nodes account for just 128, and not 256, CPU allocation even though scontrol reports 100% CPU allocation on the node. Could this be due to some misconfiguration, or a bug in the SLURM version we're running? We're running Version=23.02.4. The interesting thing is that we have six nodes that have similar specs, and all of them show up as allocated in the output of *sinfo*, but the running jobs on each node account for just 128 CPU allocation, as if they're all capped at 128.
Any thoughts, suggestions or assistance to figure this out would be greatly appreciated.
Thanks, Muhammad
You appear to have HT/SMT enabled, so I would guess Slurm is treating the node as 256 threads, 128 cpus. In other words, it'll depend on how jobs request resources (by thread or by core). You can force Slurm to ignore this distinction, if that's what you really want.
regards, mark hahn
On Sat, Mar 16, 2024 at 5:51 PM Muhammad Akhdhor via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi Everyone,
We have a SLURM cluster of three different types of nodes. One partition consists of nodes that have a large number of CPUs, 256 CPUs on each node.
I'm trying to find out the current CPU allocation on some of those nodes but part of the information I gathered seems to be incorrect. If I use "*scontrol show node <node-name>*", I get this for the CPU info:
RealMemory=450000 AllocMem=262144 FreeMem=235397 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
CPUAlloc=256 CPUEfctv=256 CPUTot=256 CPULoad=126.65 CfgTRES=cpu=256,mem=450000M,billing=256
AllocTRES=cpu=256,mem=256G
However, when I tried to identify those jobs to which the node's CPUs have been allocated, and get a tally of the allocated CPUs, I can only see 128 CPUs that are effectively allocated on that node, based on the output of *squeue --state=R -o "%C %N".* So I don't quite understand why the running jobs on the nodes account for just 128, and not 256, CPU allocation even though scontrol reports 100% CPU allocation on the node. Could this be due to some misconfiguration, or a bug in the SLURM version we're running? We're running Version=23.02.4. The interesting thing is that we have six nodes that have similar specs, and all of them show up as allocated in the output of *sinfo*, but the running jobs on each node account for just 128 CPU allocation, as if they're all capped at 128.
Any thoughts, suggestions or assistance to figure this out would be greatly appreciated.
Thanks, Muhammad
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com
Hi Mark,
Thanks for your response. Yes, we have HT enabled on those nodes. Does the SLURM documentation talk about how to force SLURM to ignore that distinction? If you could provide more information on that, or point me in the right direction, I would appreciate it.
Thanks, Muhammad
On Sun, Mar 17, 2024 at 1:15 PM Mark Hahn hahn@sharcnet.ca wrote:
You appear to have HT/SMT enabled, so I would guess Slurm is treating the node as 256 threads, 128 cpus. In other words, it'll depend on how jobs request resources (by thread or by core). You can force Slurm to ignore this distinction, if that's what you really want.
regards, mark hahn
On Sat, Mar 16, 2024 at 5:51 PM Muhammad Akhdhor via slurm-users < slurm-users@lists.schedmd.com> wrote:
Hi Everyone,
We have a SLURM cluster of three different types of nodes. One partition consists of nodes that have a large number of CPUs, 256 CPUs on each node.
I'm trying to find out the current CPU allocation on some of those nodes but part of the information I gathered seems to be incorrect. If I use "*scontrol show node <node-name>*", I get this for the CPU info:
RealMemory=450000 AllocMem=262144 FreeMem=235397 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
CPUAlloc=256 CPUEfctv=256 CPUTot=256 CPULoad=126.65 CfgTRES=cpu=256,mem=450000M,billing=256
AllocTRES=cpu=256,mem=256G
However, when I tried to identify those jobs to which the node's CPUs have been allocated, and get a tally of the allocated CPUs, I can only see 128 CPUs that are effectively allocated on that node, based on the output of *squeue --state=R -o "%C %N".* So I don't quite understand why the running jobs on the nodes account for just 128, and not 256, CPU allocation even though scontrol reports 100% CPU allocation on the node. Could this be due to some misconfiguration, or a bug in the SLURM version we're running? We're running Version=23.02.4. The interesting thing is that we have six nodes that have similar specs, and all of them show up as allocated in the output of *sinfo*, but the running jobs on each node account for just 128 CPU allocation, as if they're all capped at 128.
Any thoughts, suggestions or assistance to figure this out would be greatly appreciated.
Thanks, Muhammad
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-leave@lists.schedmd.com