[slurm-users] Question about memory allocation

Tue Dec 17 11:35:05 UTC 2019

 >Your running job is requesting 6 CPUs per node (4 nodes, 6 CPUs per
node). That means 6 CPUs are being used on node hpc.
>Your queued job is requesting 5 CPUs per node (4 nodes, 5 CPUs per node).
In total, if it was running, that would require 11 CPUs on node hpc. But
hpc only has 10 cores, so it can't run.

Right... I changed that but still the job is in pending state.
I modified /etc/slurm/slurm.conf as below

# grep hpc /etc/slurm/slurm.conf
NodeName=hpc NodeAddr=10.1.1.1 CPUs=11

# for i in {0..2}; do scontrol show node compute-0-$i | grep RealMemory;
done && scontrol show node hpc | grep RealMemory
   RealMemory=64259 AllocMem=1024 FreeMem=57116 Sockets=32 Boards=1
   RealMemory=120705 AllocMem=1024 FreeMem=66403 Sockets=32 Boards=1
   RealMemory=64259 AllocMem=1024 FreeMem=39966 Sockets=32 Boards=1
   RealMemory=64259 AllocMem=1024 FreeMem=49189 Sockets=11 Boards=1
# for i in {0..2}; do scontrol show node compute-0-$i | grep CPUTot; done
&& scontrol show node hpc | grep CPUTot
   CPUAlloc=6 CPUTot=32 CPULoad=5.18
   CPUAlloc=6 CPUTot=32 CPULoad=18.94
   CPUAlloc=6 CPUTot=32 CPULoad=5.41
   CPUAlloc=6 CPUTot=11 CPULoad=5.21

But still the job is pending

$ scontrol show -d job 129
JobId=129 JobName=qe-fb
   UserId=mahmood(1000) GroupId=mahmood(1000) MCS_label=N/A
   Priority=1751 Nice=0 Account=fish QOS=normal WCKey=*default
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=30-00:00:00 TimeMin=N/A
   SubmitTime=2019-12-17T15:00:37 EligibleTime=2019-12-17T15:00:37
   AccrueTime=2019-12-17T15:00:37
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-12-17T15:00:38
   Partition=SEA AllocNode:Sid=hpc.scu.ac.ir:14534
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=4-4 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=20,mem=40G,node=4,billing=20
   Socks/Node=* NtasksPerN:B:S:C=5:0:*:* CoreSpec=*
   MinCPUsNode=5 MinMemoryNode=10G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/mahmood/qe/f_borophene/slurm_qe.sh
   WorkDir=/home/mahmood/qe/f_borophene
   StdErr=/home/mahmood/qe/f_borophene/my_fb.log
   StdIn=/dev/null
   StdOut=/home/mahmood/qe/f_borophene/my_fb.log
   Power=

>I'm not aware of any nodes, that have 32, or even 10 sockets. Are you
sure, you want to use the cluster like that?

Marcus,
I have installed slurm via slurm roll on Rocks. All 4 nodes are dual socket
Opetron 6282 with the following specs
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2

I just wrote 11 CPUs for the head node in order to not fully utilize the
head node with jobs.

For example, compute-0-0 is

$ scontrol show node compute-0-0
NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=6 CPUTot=32 CPULoad=5.15
   AvailableFeatures=rack-0,32CPUs
   ActiveFeatures=rack-0,32CPUs
   Gres=(null)
   NodeAddr=10.1.1.254 NodeHostName=compute-0-0
   OS=Linux 3.10.0-1062.1.2.el7.x86_64 #1 SMP Mon Sep 30 14:19:46 UTC 2019
   RealMemory=64259 AllocMem=1024 FreeMem=57050 Sockets=32 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=444124 Weight=20511900 Owner=N/A
MCS_label=N/A
   Partitions=CLUSTER,WHEEL,SEA
   BootTime=2019-10-10T19:01:38 SlurmdStartTime=2019-12-17T13:50:37
   CfgTRES=cpu=32,mem=64259M,billing=47
   AllocTRES=cpu=6,mem=1G
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Regards,
Mahmood
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191217/5784638c/attachment-0001.htm>