[slurm-users] Slurm 1 CPU
Alex Chekholko
alex at calicolabs.com
Thu Apr 4 23:35:42 UTC 2019
Hi Chris,
re: "can't run more than 1 job per node at a time. "
try "scontrol show config" and grep for defmem
IIRC by default the memory request for any job is all the memory in a node.
Regards,
Alex
On Thu, Apr 4, 2019 at 4:01 PM Andy Riebs <andy.riebs at hpe.com> wrote:
> in slurm.conf, on the line(s) starting "NodeName=", you'll want to add
> specs for sockets, cores, and threads/core.
>
> ------------------------------
> *From:* Chris Bateson <cbateson at vt.edu> <cbateson at vt.edu>
> *Sent:* Thursday, April 04, 2019 5:18PM
> *To:* Slurm-users <slurm-users at lists.schedmd.com>
> <slurm-users at lists.schedmd.com>
> *Cc:*
> *Subject:* [slurm-users] Slurm 1 CPU
> I should start out by saying that I am extremely new to anything HPC. Our
> end users purchased a 20 node cluster which a vendor set up for us with
> Bright/Slurm.
>
> After our vendor said everything was complete and we started migrating our
> users workflow to the new cluster they discovered that they can't run more
> than 1 job per node at a time. We started researching enabling consumable
> resources which I believe we've done so however we're getting the same
> result.
>
> I've just discovered today that both *scontrol show node* and *sinfo -lNe*
> show that each of our nodes have 1 CPU. I'm guessing that's why we can't
> submit more than 1 job at a time. I'm trying to determine where is it
> getting this information and how can I get it to display the correct CPU
> information.
>
> Sample info:
>
> *scontrol show node*
>
> NodeName=cnode001 Arch=x86_64 CoresPerSocket=1
> CPUAlloc=0 CPUErr=0 CPUTot=1 CPULoad=0.01
> AvailableFeatures=(null)
> ActiveFeatures=(null)
> Gres=(null)
> NodeAddr=cnode001 NodeHostName=cnode001 Version=17.11
> OS=Linux 3.10.0-693.el7.x86_64 #1 SMP Thu Jul 6 19:56:57 EDT 2017
> RealMemory=192080 AllocMem=0 FreeMem=188798 Sockets=1 Boards=1
> State=IDLE ThreadsPerCore=1 TmpDisk=2038 Weight=1 Owner=N/A
> MCS_label=N/A
> Partitions=defq
> BootTime=2019-03-26T14:28:24 SlurmdStartTime=2019-03-26T14:29:55
> CfgTRES=cpu=1,mem=192080M,billing=1
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>
> *sinfo -lNe*
>
> NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK
> WEIGHT AVAIL_FE REASON
> cnode001 1 defq* idle 1 1:1:1 192080 2038
> 1 (null) none
>
>
> *lscpu*
>
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Byte Order: Little Endian
> CPU(s): 48
> On-line CPU(s) list: 0-47
> Thread(s) per core: 1
> Core(s) per socket: 24
> Socket(s): 2
> NUMA node(s): 2
> Vendor ID: GenuineIntel
> CPU family: 6
> Model: 85
> Model name: Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
> Stepping: 4
> CPU MHz: 2700.000
> BogoMIPS: 5400.00
> Virtualization: VT-x
> L1d cache: 32K
> L1i cache: 32K
> L2 cache: 1024K
> L3 cache: 33792K
> NUMA node0 CPU(s):
> 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
> NUMA node1 CPU(s):
> 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47
> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
> syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts
> rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq
> dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca
> sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c
> rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 intel_pt tpr_shadow vnmi
> flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms
> invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb
> avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc
> cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
>
>
> *slrum.conf SelectType Configuration*
>
> SelectType=select/cons_res
> SelectTypeParameters=CR_Core_Memory
> PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL
> PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO
> Shared=NO GraceTime=0 PreemptMode=OFF ReqResv=NO AllowAccounts=ALL
> AllowQos=ALL LLN=NO ExclusiveUser=NO OverSubscribe=YES OverTimeLimit=0
> State=UP Nodes=cnode[001-020]
>
>
>
> I can provide other configs if you feel that it could help.
>
> Any ideas? I would have thought that slurm would grab the CPU information
> from the CPU instead of the configuration.
>
> Thanks
> Chris
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20190404/3bba3335/attachment-0001.html>
More information about the slurm-users
mailing list