[slurm-users] I can't seem to use all the CPUs in my Cluster?

Tue Dec 13 09:38:44 UTC 2022

Dear Slurm Users, perhaps you can help me with a problem that I am having
using the Scheduler (I am new to this, so please forgive me for any stupid
mistakes/misunderstandings).

I am not able to submit a Multi-Threaded MPI job on a small demo cluster
that I have setup using Azure CycleCloud that uses all the 10x CPUs on my
cluster, and I don’t understand why – perhaps you can explain why and how I
can fix this to use all available CPUs?

The hpc partition that I have setup consists of 5 nodes (Azure VM type =
Standard_F2s_v2), each with 2 cpu’s (I presume that these are Hyperthreaded
cores, rather than 2 cpus – but I am not certain of this)?

[azccadmin at ricslurm-hpc-pg0-1 ~]$ cat /proc/cpuinfo

processor       : 0

vendor_id       : GenuineIntel

cpu family      : 6

model           : 106

model name      : Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz

stepping        : 6

microcode       : 0xffffffff

cpu MHz         : 2793.436

cache size      : 49152 KB

physical id     : 0

siblings        : 2

core id         : 0

cpu cores       : 1

apicid          : 0

initial apicid  : 0

fpu             : yes

fpu_exception   : yes

cpuid level     : 21

wp              : yes

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm
constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma
cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor
lahf_lm abm 3dnowprefetch invpcid_single tpr_shadow vnmi ept vpid fsgsbase
bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap
clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec md_clear

bogomips        : 5586.87

clflush size    : 64

cache_alignment : 64

address sizes   : 46 bits physical, 48 bits virtual

power management:

processor       : 1

vendor_id       : GenuineIntel

cpu family      : 6

model           : 106

model name      : Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz

stepping        : 6

microcode       : 0xffffffff

cpu MHz         : 2793.436

cache size      : 49152 KB

physical id     : 0

siblings        : 2

core id         : 0

cpu cores       : 1

apicid          : 1

initial apicid  : 1

fpu             : yes

fpu_exception   : yes

cpuid level     : 21

wp              : yes

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm
constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma
cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor
lahf_lm abm 3dnowprefetch invpcid_single tpr_shadow vnmi ept vpid fsgsbase
bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap
clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec md_clear

bogomips        : 5586.87

clflush size    : 64

cache_alignment : 64

address sizes   : 46 bits physical, 48 bits virtual

power management:

This is how Slurm sees one of the nodes:

[azccadmin at ricslurm-scheduler LID_CAVITY]$ scontrol show nodes

NodeName=ricslurm-hpc-pg0-1 Arch=x86_64 CoresPerSocket=1

   CPUAlloc=0 CPUEfctv=1 CPUTot=1 CPULoad=0.88

   AvailableFeatures=cloud

   ActiveFeatures=cloud

   Gres=(null)

   NodeAddr=ricslurm-hpc-pg0-1 NodeHostName=ricslurm-hpc-pg0-1
Version=22.05.3

   OS=Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020

   RealMemory=3072 AllocMem=0 FreeMem=1854 Sockets=1 Boards=1

   State=IDLE+CLOUD ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A

   Partitions=hpc

   BootTime=2022-12-12T17:42:27 SlurmdStartTime=2022-12-12T17:42:28

   LastBusyTime=2022-12-12T17:52:29

   CfgTRES=cpu=1,mem=3G,billing=1

   AllocTRES=

   CapWatts=n/a

   CurrentWatts=0 AveWatts=0

   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

This is the Slurm Job Control Script I have come up with to run the Vectis
Job (I have set 5x Node, 1x CPU, and 2x Threads – is this right?):

#!/bin/bash

## Job name

#SBATCH --job-name=run-grma

#

## File to write standard output and error

#SBATCH --output=run-grma.out

#SBATCH --error=run-grma.err

#

## Partition for the cluster (you might not need that)

#SBATCH --partition=hpc

#

## Number of nodes

#SBATCH --nodes=5

#

## Number of CPUs per nodes

#SBATCH --ntasks-per-node=1

#

## Number of CPUs per task

#SBATCH --cpus-per-task=2

#

## General

module purge

## Initialise VECTIS 2022.3b4

if [ -d /shared/apps/RealisSimulation/2022.3/bin ]

then

    export PATH=$PATH:/shared/apps/RealisSimulation/2022.3/bin

else

    echo "Failed to Initialise VECTIS"

fi

## Run

vpre -V 2022.3 -np $SLURM_NTASKS /shared/data/LID_CAVITY/files/lid.GRD

vsolve -V 2022.3 -np $SLURM_NTASKS -mpi intel_2018.4 -rdmu
/shared/data/LID_CAVITY/files/lid_no_write.inp

But, the submitted job will not run as it says that there is not enough
CPUs.

Here is the debug log from slurmctld – where you can see that it is saying
the job has requested 10 CPUs (which is what I want), but the hpc partition
only has 5 (which I think is wrong?):

[2022-12-13T09:05:01.177] debug2: Processing RPC: REQUEST_NODE_INFO from
UID=0

[2022-12-13T09:05:01.370] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB
from UID=20001

[2022-12-13T09:05:01.371] debug3: _set_hostname: Using auth hostname for
alloc_node: ricslurm-scheduler

[2022-12-13T09:05:01.371] debug3: JobDesc: user_id=20001 JobId=N/A
partition=hpc name=run-grma

[2022-12-13T09:05:01.371] debug3:    cpus=10-4294967294 pn_min_cpus=2
core_spec=-1

[2022-12-13T09:05:01.371] debug3:    Nodes=5-[5] Sock/Node=65534
Core/Sock=65534 Thread/Core=65534

[2022-12-13T09:05:01.371] debug3:    pn_min_memory_job=18446744073709551615
pn_min_tmp_disk=-1

[2022-12-13T09:05:01.371] debug3:    immediate=0 reservation=(null)

[2022-12-13T09:05:01.371] debug3:    features=(null) batch_features=(null)
cluster_features=(null) prefer=(null)

[2022-12-13T09:05:01.371] debug3:    req_nodes=(null) exc_nodes=(null)

[2022-12-13T09:05:01.371] debug3:    time_limit=15-15 priority=-1
contiguous=0 shared=-1

[2022-12-13T09:05:01.371] debug3:    kill_on_node_fail=-1 script=#!/bin/bash

## Job name

#SBATCH --job-n...

[2022-12-13T09:05:01.371] debug3:
argv="/shared/data/LID_CAVITY/slurm-runit.sh"

[2022-12-13T09:05:01.371] debug3:
environment=XDG_SESSION_ID=12,HOSTNAME=ricslurm-scheduler,SELINUX_ROLE_REQUESTED=,...

[2022-12-13T09:05:01.371] debug3:    stdin=/dev/null
stdout=/shared/data/LID_CAVITY/run-grma.out
stderr=/shared/data/LID_CAVITY/run-grma.err

[2022-12-13T09:05:01.372] debug3:    work_dir=/shared/data/LID_CAVITY
alloc_node:sid=ricslurm-scheduler:13464

[2022-12-13T09:05:01.372] debug3:    power_flags=

[2022-12-13T09:05:01.372] debug3:    resp_host=(null) alloc_resp_port=0
other_port=0

[2022-12-13T09:05:01.372] debug3:    dependency=(null) account=(null)
qos=(null) comment=(null)

[2022-12-13T09:05:01.372] debug3:    mail_type=0 mail_user=(null) nice=0
num_tasks=5 open_mode=0 overcommit=-1 acctg_freq=(null)

[2022-12-13T09:05:01.372] debug3:    network=(null) begin=Unknown
cpus_per_task=2 requeue=-1 licenses=(null)

[2022-12-13T09:05:01.372] debug3:    end_time= signal=0 at 0 wait_all_nodes=-1
cpu_freq=

[2022-12-13T09:05:01.372] debug3:    ntasks_per_node=1 ntasks_per_socket=-1
ntasks_per_core=-1 ntasks_per_tres=-1

[2022-12-13T09:05:01.372] debug3:    mem_bind=0:(null) plane_size:65534

[2022-12-13T09:05:01.372] debug3:    array_inx=(null)

[2022-12-13T09:05:01.372] debug3:    burst_buffer=(null)

[2022-12-13T09:05:01.372] debug3:    mcs_label=(null)

[2022-12-13T09:05:01.372] debug3:    deadline=Unknown

[2022-12-13T09:05:01.372] debug3:    bitflags=0x1a00c000
delay_boot=4294967294

[2022-12-13T09:05:01.372] debug3: job_submit/lua: slurm_lua_loadscript:
skipping loading Lua script: /etc/slurm/job_submit.lua

[2022-12-13T09:05:01.372] lua: Setting reqswitch to 1.

[2022-12-13T09:05:01.372] lua: returning.

[2022-12-13T09:05:01.372] debug2: _part_access_check: Job requested too
many CPUs (10) of partition hpc(5)

[2022-12-13T09:05:01.373] debug2: _part_access_check: Job requested too
many CPUs (10) of partition hpc(5)

[2022-12-13T09:05:01.373] debug2: JobId=1 can't run in partition hpc: More
processors requested than permitted

The job will run fine if I use the below settings (across 5 nodes, but only
using one of the two CPUs on each node):

## Number of nodes

#SBATCH --nodes=5

#

## Number of CPUs per nodes

#SBATCH --ntasks-per-node=1

#

## Number of CPUs per task

#SBATCH --cpus-per-task=1

Here is the successfully submitted Job details showing it using 5 CPU’s
(only one CPU per node) across 5x Nodes:

[azccadmin at ricslurm-scheduler LID_CAVITY]$ scontrol show job 3

JobId=3 JobName=run-grma

   UserId=azccadmin(20001) GroupId=azccadmin(20001) MCS_label=N/A

   Priority=4294901757 Nice=0 Account=(null) QOS=(null)

   JobState=RUNNING Reason=None Dependency=(null)

   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0

   RunTime=00:07:35 TimeLimit=00:15:00 TimeMin=N/A

   SubmitTime=2022-12-12T17:32:01 EligibleTime=2022-12-12T17:32:01

   AccrueTime=2022-12-12T17:32:01

   StartTime=2022-12-12T17:42:46 EndTime=2022-12-12T17:57:46 Deadline=N/A

   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-12-12T17:32:01
Scheduler=Main

   Partition=hpc AllocNode:Sid=ricslurm-scheduler:11723

   ReqNodeList=(null) ExcNodeList=(null)

   NodeList=ricslurm-hpc-pg0-[1-5]

   BatchHost=ricslurm-hpc-pg0-1

   NumNodes=5 NumCPUs=5 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:*

   TRES=cpu=5,mem=15G,node=5,billing=5

   Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*

   MinCPUsNode=1 MinMemoryCPU=3G MinTmpDiskNode=0

   Features=(null) DelayBoot=00:00:00

   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)

   Command=/shared/data/LID_CAVITY/slurm-runit.sh

   WorkDir=/shared/data/LID_CAVITY

   StdErr=/shared/data/LID_CAVITY/run-grma.err

   StdIn=/dev/null

   StdOut=/shared/data/LID_CAVITY/run-grma.out

   Switches=1 at 00:00:24

   Power=

What am I doing wrong here - how do I get it to run the job on both CPU’s
on all 5 nodes (i.e. fully utilising the available cluster resources of 10x
CPUs)?

Regards

Gary
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221213/83f60cd9/attachment-0001.htm>