[slurm-users] Slurm 1 CPU

Alex Chekholko alex at calicolabs.com
Thu Apr 4 23:35:42 UTC 2019


Hi Chris,

re: "can't run more than 1 job per node at a time.  "

try "scontrol show config" and grep for defmem

IIRC by default the memory request for any job is all the memory in a node.

Regards,
Alex

On Thu, Apr 4, 2019 at 4:01 PM Andy Riebs <andy.riebs at hpe.com> wrote:

> in slurm.conf, on the line(s) starting "NodeName=", you'll want to add
> specs for sockets, cores, and threads/core.
>
> ------------------------------
> *From:* Chris Bateson <cbateson at vt.edu> <cbateson at vt.edu>
> *Sent:* Thursday, April 04, 2019 5:18PM
> *To:* Slurm-users <slurm-users at lists.schedmd.com>
> <slurm-users at lists.schedmd.com>
> *Cc:*
> *Subject:* [slurm-users] Slurm 1 CPU
> I should start out by saying that I am extremely new to anything HPC.  Our
> end users purchased a 20 node cluster which a vendor set up for us with
> Bright/Slurm.
>
> After our vendor said everything was complete and we started migrating our
> users workflow to the new cluster they discovered that they can't run more
> than 1 job per node at a time.  We started researching enabling consumable
> resources which I believe we've done so however we're getting the same
> result.
>
> I've just discovered today that both *scontrol show node* and *sinfo -lNe*
> show that each of our nodes have 1 CPU.  I'm guessing that's why we can't
> submit more than 1 job at a time.  I'm trying to determine where is it
> getting this information and how can I get it to display the correct CPU
> information.
>
> Sample info:
>
> *scontrol show node*
>
> NodeName=cnode001 Arch=x86_64 CoresPerSocket=1
>    CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=0.01
>    AvailableFeatures=(null)
>    ActiveFeatures=(null)
>    Gres=(null)
>    NodeAddr=cnode001 NodeHostName=cnode001 Version=17.11
>    OS=Linux 3.10.0-693.el7.x86_64 #1 SMP Thu Jul 6 19:56:57 EDT 2017
>    RealMemory=192080 AllocMem=0 FreeMem=188798 Sockets=1 Boards=1
>    State=IDLE ThreadsPerCore=1 TmpDisk=2038 Weight=1 Owner=N/A
> MCS_label=N/A
>    Partitions=defq
>    BootTime=2019-03-26T14:28:24 SlurmdStartTime=2019-03-26T14:29:55
>    CfgTRES=cpu=1,mem=192080M,billing=1
>    AllocTRES=
>    CapWatts=n/a
>    CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
>    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>
> *sinfo -lNe*
>
> NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK
> WEIGHT AVAIL_FE REASON
> cnode001       1     defq*        idle    1    1:1:1 192080     2038
> 1   (null) none
>
>
> *lscpu*
>
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                48
> On-line CPU(s) list:   0-47
> Thread(s) per core:    1
> Core(s) per socket:    24
> Socket(s):             2
> NUMA node(s):          2
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 85
> Model name:            Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
> Stepping:              4
> CPU MHz:               2700.000
> BogoMIPS:              5400.00
> Virtualization:        VT-x
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              1024K
> L3 cache:              33792K
> NUMA node0 CPU(s):
>  0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
> NUMA node1 CPU(s):
>  1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
> Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
> syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
> rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq
> dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca
> sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
> rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_pt tpr_shadow vnmi
> flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms
> invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb
> avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc
> cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
>
>
> *slrum.conf SelectType Configuration*
>
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core_Memory
> PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
> PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
> Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL
> AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=YES OverTimeLimit=0
> State=UP Nodes=cnode[001-020]
>
>
>
> I can provide other configs if you feel that it could help.
>
> Any ideas?  I would have thought that slurm would grab the CPU information
> from the CPU instead of the configuration.
>
> Thanks
> Chris
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190404/3bba3335/attachment-0001.html>


More information about the slurm-users mailing list