[slurm-users] I can't seem to use all the CPUs in my Cluster?
Gary Mansell
gary.mansell at gmail.com
Tue Dec 13 09:38:44 UTC 2022
Dear Slurm Users, perhaps you can help me with a problem that I am having
using the Scheduler (I am new to this, so please forgive me for any stupid
mistakes/misunderstandings).
I am not able to submit a Multi-Threaded MPI job on a small demo cluster
that I have setup using Azure CycleCloud that uses all the 10x CPUs on my
cluster, and I don’t understand why – perhaps you can explain why and how I
can fix this to use all available CPUs?
The hpc partition that I have setup consists of 5 nodes (Azure VM type =
Standard_F2s_v2), each with 2 cpu’s (I presume that these are Hyperthreaded
cores, rather than 2 cpus – but I am not certain of this)?
[azccadmin at ricslurm-hpc-pg0-1 ~]$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 106
model name : Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
stepping : 6
microcode : 0xffffffff
cpu MHz : 2793.436
cache size : 49152 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 21
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm
constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma
cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor
lahf_lm abm 3dnowprefetch invpcid_single tpr_shadow vnmi ept vpid fsgsbase
bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap
clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec md_clear
bogomips : 5586.87
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 106
model name : Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
stepping : 6
microcode : 0xffffffff
cpu MHz : 2793.436
cache size : 49152 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 21
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm
constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma
cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor
lahf_lm abm 3dnowprefetch invpcid_single tpr_shadow vnmi ept vpid fsgsbase
bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap
clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec md_clear
bogomips : 5586.87
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
This is how Slurm sees one of the nodes:
[azccadmin at ricslurm-scheduler LID_CAVITY]$ scontrol show nodes
NodeName=ricslurm-hpc-pg0-1 Arch=x86_64 CoresPerSocket=1
CPUAlloc=0 CPUEfctv=1 CPUTot=1 CPULoad=0.88
AvailableFeatures=cloud
ActiveFeatures=cloud
Gres=(null)
NodeAddr=ricslurm-hpc-pg0-1 NodeHostName=ricslurm-hpc-pg0-1
Version=22.05.3
OS=Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020
RealMemory=3072 AllocMem=0 FreeMem=1854 Sockets=1 Boards=1
State=IDLE+CLOUD ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
Partitions=hpc
BootTime=2022-12-12T17:42:27 SlurmdStartTime=2022-12-12T17:42:28
LastBusyTime=2022-12-12T17:52:29
CfgTRES=cpu=1,mem=3G,billing=1
AllocTRES=
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
This is the Slurm Job Control Script I have come up with to run the Vectis
Job (I have set 5x Node, 1x CPU, and 2x Threads – is this right?):
#!/bin/bash
## Job name
#SBATCH --job-name=run-grma
#
## File to write standard output and error
#SBATCH --output=run-grma.out
#SBATCH --error=run-grma.err
#
## Partition for the cluster (you might not need that)
#SBATCH --partition=hpc
#
## Number of nodes
#SBATCH --nodes=5
#
## Number of CPUs per nodes
#SBATCH --ntasks-per-node=1
#
## Number of CPUs per task
#SBATCH --cpus-per-task=2
#
## General
module purge
## Initialise VECTIS 2022.3b4
if [ -d /shared/apps/RealisSimulation/2022.3/bin ]
then
export PATH=$PATH:/shared/apps/RealisSimulation/2022.3/bin
else
echo "Failed to Initialise VECTIS"
fi
## Run
vpre -V 2022.3 -np $SLURM_NTASKS /shared/data/LID_CAVITY/files/lid.GRD
vsolve -V 2022.3 -np $SLURM_NTASKS -mpi intel_2018.4 -rdmu
/shared/data/LID_CAVITY/files/lid_no_write.inp
But, the submitted job will not run as it says that there is not enough
CPUs.
Here is the debug log from slurmctld – where you can see that it is saying
the job has requested 10 CPUs (which is what I want), but the hpc partition
only has 5 (which I think is wrong?):
[2022-12-13T09:05:01.177] debug2: Processing RPC: REQUEST_NODE_INFO from
UID=0
[2022-12-13T09:05:01.370] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB
from UID=20001
[2022-12-13T09:05:01.371] debug3: _set_hostname: Using auth hostname for
alloc_node: ricslurm-scheduler
[2022-12-13T09:05:01.371] debug3: JobDesc: user_id=20001 JobId=N/A
partition=hpc name=run-grma
[2022-12-13T09:05:01.371] debug3: cpus=10-4294967294 pn_min_cpus=2
core_spec=-1
[2022-12-13T09:05:01.371] debug3: Nodes=5-[5] Sock/Node=65534
Core/Sock=65534 Thread/Core=65534
[2022-12-13T09:05:01.371] debug3: pn_min_memory_job=18446744073709551615
pn_min_tmp_disk=-1
[2022-12-13T09:05:01.371] debug3: immediate=0 reservation=(null)
[2022-12-13T09:05:01.371] debug3: features=(null) batch_features=(null)
cluster_features=(null) prefer=(null)
[2022-12-13T09:05:01.371] debug3: req_nodes=(null) exc_nodes=(null)
[2022-12-13T09:05:01.371] debug3: time_limit=15-15 priority=-1
contiguous=0 shared=-1
[2022-12-13T09:05:01.371] debug3: kill_on_node_fail=-1 script=#!/bin/bash
## Job name
#SBATCH --job-n...
[2022-12-13T09:05:01.371] debug3:
argv="/shared/data/LID_CAVITY/slurm-runit.sh"
[2022-12-13T09:05:01.371] debug3:
environment=XDG_SESSION_ID=12,HOSTNAME=ricslurm-scheduler,SELINUX_ROLE_REQUESTED=,...
[2022-12-13T09:05:01.371] debug3: stdin=/dev/null
stdout=/shared/data/LID_CAVITY/run-grma.out
stderr=/shared/data/LID_CAVITY/run-grma.err
[2022-12-13T09:05:01.372] debug3: work_dir=/shared/data/LID_CAVITY
alloc_node:sid=ricslurm-scheduler:13464
[2022-12-13T09:05:01.372] debug3: power_flags=
[2022-12-13T09:05:01.372] debug3: resp_host=(null) alloc_resp_port=0
other_port=0
[2022-12-13T09:05:01.372] debug3: dependency=(null) account=(null)
qos=(null) comment=(null)
[2022-12-13T09:05:01.372] debug3: mail_type=0 mail_user=(null) nice=0
num_tasks=5 open_mode=0 overcommit=-1 acctg_freq=(null)
[2022-12-13T09:05:01.372] debug3: network=(null) begin=Unknown
cpus_per_task=2 requeue=-1 licenses=(null)
[2022-12-13T09:05:01.372] debug3: end_time= signal=0 at 0 wait_all_nodes=-1
cpu_freq=
[2022-12-13T09:05:01.372] debug3: ntasks_per_node=1 ntasks_per_socket=-1
ntasks_per_core=-1 ntasks_per_tres=-1
[2022-12-13T09:05:01.372] debug3: mem_bind=0:(null) plane_size:65534
[2022-12-13T09:05:01.372] debug3: array_inx=(null)
[2022-12-13T09:05:01.372] debug3: burst_buffer=(null)
[2022-12-13T09:05:01.372] debug3: mcs_label=(null)
[2022-12-13T09:05:01.372] debug3: deadline=Unknown
[2022-12-13T09:05:01.372] debug3: bitflags=0x1a00c000
delay_boot=4294967294
[2022-12-13T09:05:01.372] debug3: job_submit/lua: slurm_lua_loadscript:
skipping loading Lua script: /etc/slurm/job_submit.lua
[2022-12-13T09:05:01.372] lua: Setting reqswitch to 1.
[2022-12-13T09:05:01.372] lua: returning.
[2022-12-13T09:05:01.372] debug2: _part_access_check: Job requested too
many CPUs (10) of partition hpc(5)
[2022-12-13T09:05:01.373] debug2: _part_access_check: Job requested too
many CPUs (10) of partition hpc(5)
[2022-12-13T09:05:01.373] debug2: JobId=1 can't run in partition hpc: More
processors requested than permitted
The job will run fine if I use the below settings (across 5 nodes, but only
using one of the two CPUs on each node):
## Number of nodes
#SBATCH --nodes=5
#
## Number of CPUs per nodes
#SBATCH --ntasks-per-node=1
#
## Number of CPUs per task
#SBATCH --cpus-per-task=1
Here is the successfully submitted Job details showing it using 5 CPU’s
(only one CPU per node) across 5x Nodes:
[azccadmin at ricslurm-scheduler LID_CAVITY]$ scontrol show job 3
JobId=3 JobName=run-grma
UserId=azccadmin(20001) GroupId=azccadmin(20001) MCS_label=N/A
Priority=4294901757 Nice=0 Account=(null) QOS=(null)
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:07:35 TimeLimit=00:15:00 TimeMin=N/A
SubmitTime=2022-12-12T17:32:01 EligibleTime=2022-12-12T17:32:01
AccrueTime=2022-12-12T17:32:01
StartTime=2022-12-12T17:42:46 EndTime=2022-12-12T17:57:46 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-12-12T17:32:01
Scheduler=Main
Partition=hpc AllocNode:Sid=ricslurm-scheduler:11723
ReqNodeList=(null) ExcNodeList=(null)
NodeList=ricslurm-hpc-pg0-[1-5]
BatchHost=ricslurm-hpc-pg0-1
NumNodes=5 NumCPUs=5 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=5,mem=15G,node=5,billing=5
Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=3G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/shared/data/LID_CAVITY/slurm-runit.sh
WorkDir=/shared/data/LID_CAVITY
StdErr=/shared/data/LID_CAVITY/run-grma.err
StdIn=/dev/null
StdOut=/shared/data/LID_CAVITY/run-grma.out
Switches=1 at 00:00:24
Power=
What am I doing wrong here - how do I get it to run the job on both CPU’s
on all 5 nodes (i.e. fully utilising the available cluster resources of 10x
CPUs)?
Regards
Gary
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221213/83f60cd9/attachment-0001.htm>
More information about the slurm-users
mailing list