[slurm-users] I can't seem to use all the CPUs in my Cluster?

Tue Dec 13 16:03:07 UTC 2022

Hi, thanks for getting back to me.

I have been doing some more experimenting, and I think that the issue is
because the Azure VMs for my nodes are HyperThreaded.

Slurm sees the cluser as 5 nodes with 1 CPU and seems to ignore the
HyperThreading - so hence Slurm sees the cluster as a 5 CPU cluster (and
not 10 as I thought) - so it is correct that it can't run a 10 cpu job.

Speaking with my CFD types - they say our code should not be run on HT
nodes, so I have switched to a different Azure VM sku for the nodes without
HT, and the CPU count in Slurm matches the count of those in the VMs.

So - does Slurm actually ignore HT cores, as I am supposing?

Regards
Gary

On Tue, 13 Dec 2022 at 15:52, Brian Andrus <toomuchit at gmail.com> wrote:

> Gary,
>
> Well your first issue is using Cyclecloud, but that is mostly opinion :)
>
> Your error states there aren't enough CPUs in the partition, which means
> we should take a look at the partition settings.
>
> Take a look at 'scontrol show partition hpc' and see how many nodes are
> assigned to it. Also check the state of the nodes with 'sinfo'
>
> It would also be good to ensure the node settings are right. Run 'slurmd
> -C' on a node and see if the output matches what is in the config.
>
> Brian Andrus
> On 12/13/2022 1:38 AM, Gary Mansell wrote:
>
> Dear Slurm Users, perhaps you can help me with a problem that I am having
> using the Scheduler (I am new to this, so please forgive me for any stupid
> mistakes/misunderstandings).
>
> I am not able to submit a Multi-Threaded MPI job on a small demo cluster
> that I have setup using Azure CycleCloud that uses all the 10x CPUs on my
> cluster, and I don’t understand why – perhaps you can explain why and how I
> can fix this to use all available CPUs?
>
>
>
> The hpc partition that I have setup consists of 5 nodes (Azure VM type =
> Standard_F2s_v2), each with 2 cpu’s (I presume that these are Hyperthreaded
> cores, rather than 2 cpus – but I am not certain of this)?
>
>
>
> [azccadmin at ricslurm-hpc-pg0-1 ~]$ cat /proc/cpuinfo
>
> processor       : 0
>
> vendor_id       : GenuineIntel
>
> cpu family      : 6
>
> model           : 106
>
> model name      : Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
>
> stepping        : 6
>
> microcode       : 0xffffffff
>
> cpu MHz         : 2793.436
>
> cache size      : 49152 KB
>
> physical id     : 0
>
> siblings        : 2
>
> core id         : 0
>
> cpu cores       : 1
>
> apicid          : 0
>
> initial apicid  : 0
>
> fpu             : yes
>
> fpu_exception   : yes
>
> cpuid level     : 21
>
> wp              : yes
>
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm
> constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma
> cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor
> lahf_lm abm 3dnowprefetch invpcid_single tpr_shadow vnmi ept vpid fsgsbase
> bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap
> clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec md_clear
>
> bogomips        : 5586.87
>
> clflush size    : 64
>
> cache_alignment : 64
>
> address sizes   : 46 bits physical, 48 bits virtual
>
> power management:
>
>
>
> processor       : 1
>
> vendor_id       : GenuineIntel
>
> cpu family      : 6
>
> model           : 106
>
> model name      : Intel(R) Xeon(R) Platinum 8370C CPU @ 2.80GHz
>
> stepping        : 6
>
> microcode       : 0xffffffff
>
> cpu MHz         : 2793.436
>
> cache size      : 49152 KB
>
> physical id     : 0
>
> siblings        : 2
>
> core id         : 0
>
> cpu cores       : 1
>
> apicid          : 1
>
> initial apicid  : 1
>
> fpu             : yes
>
> fpu_exception   : yes
>
> cpuid level     : 21
>
> wp              : yes
>
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
> cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm
> constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma
> cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor
> lahf_lm abm 3dnowprefetch invpcid_single tpr_shadow vnmi ept vpid fsgsbase
> bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap
> clflushopt avx512cd avx512bw avx512vl xsaveopt xsavec md_clear
>
> bogomips        : 5586.87
>
> clflush size    : 64
>
> cache_alignment : 64
>
> address sizes   : 46 bits physical, 48 bits virtual
>
> power management:
>
>
>
> This is how Slurm sees one of the nodes:
>
>
>
> [azccadmin at ricslurm-scheduler LID_CAVITY]$ scontrol show nodes
>
> NodeName=ricslurm-hpc-pg0-1 Arch=x86_64 CoresPerSocket=1
>
>    CPUAlloc=0 CPUEfctv=1 CPUTot=1 CPULoad=0.88
>
>    AvailableFeatures=cloud
>
>    ActiveFeatures=cloud
>
>    Gres=(null)
>
>    NodeAddr=ricslurm-hpc-pg0-1 NodeHostName=ricslurm-hpc-pg0-1
> Version=22.05.3
>
>    OS=Linux 3.10.0-1127.19.1.el7.x86_64 #1 SMP Tue Aug 25 17:23:54 UTC 2020
>
>    RealMemory=3072 AllocMem=0 FreeMem=1854 Sockets=1 Boards=1
>
>    State=IDLE+CLOUD ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A
> MCS_label=N/A
>
>    Partitions=hpc
>
>    BootTime=2022-12-12T17:42:27 SlurmdStartTime=2022-12-12T17:42:28
>
>    LastBusyTime=2022-12-12T17:52:29
>
>    CfgTRES=cpu=1,mem=3G,billing=1
>
>    AllocTRES=
>
>    CapWatts=n/a
>
>    CurrentWatts=0 AveWatts=0
>
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>
>
>
>
> This is the Slurm Job Control Script I have come up with to run the Vectis
> Job (I have set 5x Node, 1x CPU, and 2x Threads – is this right?):
>
>
>
> #!/bin/bash
>
>
>
> ## Job name
>
> #SBATCH --job-name=run-grma
>
> #
>
> ## File to write standard output and error
>
> #SBATCH --output=run-grma.out
>
> #SBATCH --error=run-grma.err
>
> #
>
> ## Partition for the cluster (you might not need that)
>
> #SBATCH --partition=hpc
>
> #
>
> ## Number of nodes
>
> #SBATCH --nodes=5
>
> #
>
> ## Number of CPUs per nodes
>
> #SBATCH --ntasks-per-node=1
>
> #
>
> ## Number of CPUs per task
>
> #SBATCH --cpus-per-task=2
>
> #
>
>
>
> ## General
>
> module purge
>
>
>
> ## Initialise VECTIS 2022.3b4
>
> if [ -d /shared/apps/RealisSimulation/2022.3/bin ]
>
> then
>
>     export PATH=$PATH:/shared/apps/RealisSimulation/2022.3/bin
>
> else
>
>     echo "Failed to Initialise VECTIS"
>
> fi
>
>
>
> ## Run
>
>
>
> vpre -V 2022.3 -np $SLURM_NTASKS /shared/data/LID_CAVITY/files/lid.GRD
>
> vsolve -V 2022.3 -np $SLURM_NTASKS -mpi intel_2018.4 -rdmu
> /shared/data/LID_CAVITY/files/lid_no_write.inp
>
>
>
>
>
> But, the submitted job will not run as it says that there is not enough
> CPUs.
>
>
>
> Here is the debug log from slurmctld – where you can see that it is saying
> the job has requested 10 CPUs (which is what I want), but the hpc partition
> only has 5 (which I think is wrong?):
>
>
>
> [2022-12-13T09:05:01.177] debug2: Processing RPC: REQUEST_NODE_INFO from
> UID=0
>
> [2022-12-13T09:05:01.370] debug2: Processing RPC: REQUEST_SUBMIT_BATCH_JOB
> from UID=20001
>
> [2022-12-13T09:05:01.371] debug3: _set_hostname: Using auth hostname for
> alloc_node: ricslurm-scheduler
>
> [2022-12-13T09:05:01.371] debug3: JobDesc: user_id=20001 JobId=N/A
> partition=hpc name=run-grma
>
> [2022-12-13T09:05:01.371] debug3:    cpus=10-4294967294 pn_min_cpus=2
> core_spec=-1
>
> [2022-12-13T09:05:01.371] debug3:    Nodes=5-[5] Sock/Node=65534
> Core/Sock=65534 Thread/Core=65534
>
> [2022-12-13T09:05:01.371] debug3:
> pn_min_memory_job=18446744073709551615 pn_min_tmp_disk=-1
>
> [2022-12-13T09:05:01.371] debug3:    immediate=0 reservation=(null)
>
> [2022-12-13T09:05:01.371] debug3:    features=(null) batch_features=(null)
> cluster_features=(null) prefer=(null)
>
> [2022-12-13T09:05:01.371] debug3:    req_nodes=(null) exc_nodes=(null)
>
> [2022-12-13T09:05:01.371] debug3:    time_limit=15-15 priority=-1
> contiguous=0 shared=-1
>
> [2022-12-13T09:05:01.371] debug3:    kill_on_node_fail=-1
> script=#!/bin/bash
>
>
>
> ## Job name
>
> #SBATCH --job-n...
>
> [2022-12-13T09:05:01.371] debug3:
> argv="/shared/data/LID_CAVITY/slurm-runit.sh"
>
> [2022-12-13T09:05:01.371] debug3:
> environment=XDG_SESSION_ID=12,HOSTNAME=ricslurm-scheduler,SELINUX_ROLE_REQUESTED=,...
>
> [2022-12-13T09:05:01.371] debug3:    stdin=/dev/null
> stdout=/shared/data/LID_CAVITY/run-grma.out
> stderr=/shared/data/LID_CAVITY/run-grma.err
>
> [2022-12-13T09:05:01.372] debug3:    work_dir=/shared/data/LID_CAVITY
> alloc_node:sid=ricslurm-scheduler:13464
>
> [2022-12-13T09:05:01.372] debug3:    power_flags=
>
> [2022-12-13T09:05:01.372] debug3:    resp_host=(null) alloc_resp_port=0
> other_port=0
>
> [2022-12-13T09:05:01.372] debug3:    dependency=(null) account=(null)
> qos=(null) comment=(null)
>
> [2022-12-13T09:05:01.372] debug3:    mail_type=0 mail_user=(null) nice=0
> num_tasks=5 open_mode=0 overcommit=-1 acctg_freq=(null)
>
> [2022-12-13T09:05:01.372] debug3:    network=(null) begin=Unknown
> cpus_per_task=2 requeue=-1 licenses=(null)
>
> [2022-12-13T09:05:01.372] debug3:    end_time= signal=0 at 0
> wait_all_nodes=-1 cpu_freq=
>
> [2022-12-13T09:05:01.372] debug3:    ntasks_per_node=1
> ntasks_per_socket=-1 ntasks_per_core=-1 ntasks_per_tres=-1
>
> [2022-12-13T09:05:01.372] debug3:    mem_bind=0:(null) plane_size:65534
>
> [2022-12-13T09:05:01.372] debug3:    array_inx=(null)
>
> [2022-12-13T09:05:01.372] debug3:    burst_buffer=(null)
>
> [2022-12-13T09:05:01.372] debug3:    mcs_label=(null)
>
> [2022-12-13T09:05:01.372] debug3:    deadline=Unknown
>
> [2022-12-13T09:05:01.372] debug3:    bitflags=0x1a00c000
> delay_boot=4294967294
>
> [2022-12-13T09:05:01.372] debug3: job_submit/lua: slurm_lua_loadscript:
> skipping loading Lua script: /etc/slurm/job_submit.lua
>
> [2022-12-13T09:05:01.372] lua: Setting reqswitch to 1.
>
> [2022-12-13T09:05:01.372] lua: returning.
>
> [2022-12-13T09:05:01.372] debug2: _part_access_check: Job requested too
> many CPUs (10) of partition hpc(5)
>
> [2022-12-13T09:05:01.373] debug2: _part_access_check: Job requested too
> many CPUs (10) of partition hpc(5)
>
> [2022-12-13T09:05:01.373] debug2: JobId=1 can't run in partition hpc: More
> processors requested than permitted
>
>
>
>
>
> The job will run fine if I use the below settings (across 5 nodes, but
> only using one of the two CPUs on each node):
>
>
>
> ## Number of nodes
>
> #SBATCH --nodes=5
>
> #
>
> ## Number of CPUs per nodes
>
> #SBATCH --ntasks-per-node=1
>
> #
>
> ## Number of CPUs per task
>
> #SBATCH --cpus-per-task=1
>
>
>
> Here is the successfully submitted Job details showing it using 5 CPU’s
> (only one CPU per node) across 5x Nodes:
>
>
>
> [azccadmin at ricslurm-scheduler LID_CAVITY]$ scontrol show job 3
>
> JobId=3 JobName=run-grma
>
>    UserId=azccadmin(20001) GroupId=azccadmin(20001) MCS_label=N/A
>
>    Priority=4294901757 Nice=0 Account=(null) QOS=(null)
>
>    JobState=RUNNING Reason=None Dependency=(null)
>
>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>
>    RunTime=00:07:35 TimeLimit=00:15:00 TimeMin=N/A
>
>    SubmitTime=2022-12-12T17:32:01 EligibleTime=2022-12-12T17:32:01
>
>    AccrueTime=2022-12-12T17:32:01
>
>    StartTime=2022-12-12T17:42:46 EndTime=2022-12-12T17:57:46 Deadline=N/A
>
>    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2022-12-12T17:32:01
> Scheduler=Main
>
>    Partition=hpc AllocNode:Sid=ricslurm-scheduler:11723
>
>    ReqNodeList=(null) ExcNodeList=(null)
>
>    NodeList=ricslurm-hpc-pg0-[1-5]
>
>    BatchHost=ricslurm-hpc-pg0-1
>
>    NumNodes=5 NumCPUs=5 NumTasks=5 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>
>    TRES=cpu=5,mem=15G,node=5,billing=5
>
>    Socks/Node=* NtasksPerN:B:S:C=1:0:*:* CoreSpec=*
>
>    MinCPUsNode=1 MinMemoryCPU=3G MinTmpDiskNode=0
>
>    Features=(null) DelayBoot=00:00:00
>
>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>
>    Command=/shared/data/LID_CAVITY/slurm-runit.sh
>
>    WorkDir=/shared/data/LID_CAVITY
>
>    StdErr=/shared/data/LID_CAVITY/run-grma.err
>
>    StdIn=/dev/null
>
>    StdOut=/shared/data/LID_CAVITY/run-grma.out
>
>    Switches=1 at 00:00:24
>
>    Power=
>
>
>
>
> What am I doing wrong here - how do I get it to run the job on both CPU’s
> on all 5 nodes (i.e. fully utilising the available cluster resources of 10x
> CPUs)?
>
>
>
> Regards
>
>
>
> Gary
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20221213/8b6101c8/attachment-0001.htm>