[slurm-users] Question about memory allocation
Sean Crosby
scrosby at unimelb.edu.au
Tue Dec 17 12:06:41 UTC 2019
What services did you restart after changing the slurm.conf? Did you do an scontrol reconfigure?
Do you have any reservations? scontrol show res
Sean
On Tue, 17 Dec. 2019, 10:35 pm Mahmood Naderan, <mahmood.nt at gmail.com<mailto:mahmood.nt at gmail.com>> wrote:
>Your running job is requesting 6 CPUs per node (4 nodes, 6 CPUs per node). That means 6 CPUs are being used on node hpc.
>Your queued job is requesting 5 CPUs per node (4 nodes, 5 CPUs per node). In total, if it was running, that would require 11 CPUs on node hpc. But hpc only has 10 cores, so it can't run.
Right... I changed that but still the job is in pending state.
I modified /etc/slurm/slurm.conf as below
# grep hpc /etc/slurm/slurm.conf
NodeName=hpc NodeAddr=10.1.1.1 CPUs=11
# for i in {0..2}; do scontrol show node compute-0-$i | grep RealMemory; done && scontrol show node hpc | grep RealMemory
RealMemory=64259 AllocMem=1024 FreeMem=57116 Sockets=32 Boards=1
RealMemory=120705 AllocMem=1024 FreeMem=66403 Sockets=32 Boards=1
RealMemory=64259 AllocMem=1024 FreeMem=39966 Sockets=32 Boards=1
RealMemory=64259 AllocMem=1024 FreeMem=49189 Sockets=11 Boards=1
# for i in {0..2}; do scontrol show node compute-0-$i | grep CPUTot; done && scontrol show node hpc | grep CPUTot
CPUAlloc=6 CPUTot=32 CPULoad=5.18
CPUAlloc=6 CPUTot=32 CPULoad=18.94
CPUAlloc=6 CPUTot=32 CPULoad=5.41
CPUAlloc=6 CPUTot=11 CPULoad=5.21
But still the job is pending
$ scontrol show -d job 129
JobId=129 JobName=qe-fb
UserId=mahmood(1000) GroupId=mahmood(1000) MCS_label=N/A
Priority=1751 Nice=0 Account=fish QOS=normal WCKey=*default
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:00:00 TimeLimit=30-00:00:00 TimeMin=N/A
SubmitTime=2019-12-17T15:00:37 EligibleTime=2019-12-17T15:00:37
AccrueTime=2019-12-17T15:00:37
StartTime=Unknown EndTime=Unknown Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-12-17T15:00:38
Partition=SEA AllocNode:Sid=hpc.scu.ac.ir:14534<http://hpc.scu.ac.ir:14534>
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=4-4 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=20,mem=40G,node=4,billing=20
Socks/Node=* NtasksPerN:B:S:C=5:0:*:* CoreSpec=*
MinCPUsNode=5 MinMemoryNode=10G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/mahmood/qe/f_borophene/slurm_qe.sh
WorkDir=/home/mahmood/qe/f_borophene
StdErr=/home/mahmood/qe/f_borophene/my_fb.log
StdIn=/dev/null
StdOut=/home/mahmood/qe/f_borophene/my_fb.log
Power=
>I'm not aware of any nodes, that have 32, or even 10 sockets. Are you sure, you want to use the cluster like that?
Marcus,
I have installed slurm via slurm roll on Rocks. All 4 nodes are dual socket Opetron 6282 with the following specs
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
I just wrote 11 CPUs for the head node in order to not fully utilize the head node with jobs.
For example, compute-0-0 is
$ scontrol show node compute-0-0
NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=1
CPUAlloc=6 CPUTot=32 CPULoad=5.15
AvailableFeatures=rack-0,32CPUs
ActiveFeatures=rack-0,32CPUs
Gres=(null)
NodeAddr=10.1.1.254 NodeHostName=compute-0-0
OS=Linux 3.10.0-1062.1.2.el7.x86_64 #1 SMP Mon Sep 30 14:19:46 UTC 2019
RealMemory=64259 AllocMem=1024 FreeMem=57050 Sockets=32 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=444124 Weight=20511900 Owner=N/A MCS_label=N/A
Partitions=CLUSTER,WHEEL,SEA
BootTime=2019-10-10T19:01:38 SlurmdStartTime=2019-12-17T13:50:37
CfgTRES=cpu=32,mem=64259M,billing=47
AllocTRES=cpu=6,mem=1G
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Regards,
Mahmood
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191217/ef52b8f4/attachment.htm>
More information about the slurm-users
mailing list