[slurm-users] Question about memory allocation
Sean Crosby
scrosby at unimelb.edu.au
Tue Dec 17 09:31:00 UTC 2019
Hi Mahmood,
Your running job is requesting 6 CPUs per node (4 nodes, 6 CPUs per node). That means 6 CPUs are being used on node hpc.
Your queued job is requesting 5 CPUs per node (4 nodes, 5 CPUs per node). In total, if it was running, that would require 11 CPUs on node hpc. But hpc only has 10 cores, so it can't run.
Sean
On Tue, 17 Dec 2019 at 20:03, Mahmood Naderan <mahmood.nt at gmail.com<mailto:mahmood.nt at gmail.com>> wrote:
Please see the latest update
# for i in {0..2}; do scontrol show node compute-0-$i | grep RealMemory; done && scontrol show node hpc | grep RealMemory
RealMemory=64259 AllocMem=1024 FreeMem=57163 Sockets=32 Boards=1
RealMemory=120705 AllocMem=1024 FreeMem=97287 Sockets=32 Boards=1
RealMemory=64259 AllocMem=1024 FreeMem=40045 Sockets=32 Boards=1
RealMemory=64259 AllocMem=1024 FreeMem=24154 Sockets=10 Boards=1
$ sbatch slurm_qe.sh
Submitted batch job 125
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
125 SEA qe-fb mahmood PD 0:00 4 (Resources)
124 SEA U1phi1 abspou R 3:52 4 compute-0-[0-2],hpc
$ scontrol show -d job 125
JobId=125 JobName=qe-fb
UserId=mahmood(1000) GroupId=mahmood(1000) MCS_label=N/A
Priority=1751 Nice=0 Account=fish QOS=normal WCKey=*default
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:00:00 TimeLimit=30-00:00:00 TimeMin=N/A
SubmitTime=2019-12-17T12:29:08 EligibleTime=2019-12-17T12:29:08
AccrueTime=2019-12-17T12:29:08
StartTime=Unknown EndTime=Unknown Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-12-17T12:29:09
Partition=SEA AllocNode:Sid=hpc.scu.ac.ir:22742<http://hpc.scu.ac.ir:22742>
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=4-4 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=20,mem=40G,node=4,billing=20
Socks/Node=* NtasksPerN:B:S:C=5:0:*:* CoreSpec=*
MinCPUsNode=5 MinMemoryNode=10G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/mahmood/qe/f_borophene/slurm_qe.sh
WorkDir=/home/mahmood/qe/f_borophene
StdErr=/home/mahmood/qe/f_borophene/my_fb.log
StdIn=/dev/null
StdOut=/home/mahmood/qe/f_borophene/my_fb.log
Power=
$ cat slurm_qe.sh
#!/bin/bash
#SBATCH --job-name=qe-fb
#SBATCH --output=my_fb.log
#SBATCH --partition=SEA
#SBATCH --account=fish
#SBATCH --mem=10GB
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=5
mpirun -np $SLURM_NTASKS /share/apps/q-e-qe-6.5/bin/pw.x -in f_borophene_scf.in<http://f_borophene_scf.in>
You can also see the job detail of 124
$ scontrol show -d job 124
JobId=124 JobName=U1phi1
UserId= abspou(1002) GroupId= abspou(1002) MCS_label=N/A
Priority=958 Nice=0 Account=fish QOS=normal WCKey=*default
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:06:17 TimeLimit=30-00:00:00 TimeMin=N/A
SubmitTime=2019-12-17T12:25:17 EligibleTime=2019-12-17T12:25:17
AccrueTime=2019-12-17T12:25:17
StartTime=2019-12-17T12:25:17 EndTime=2020-01-16T12:25:17 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-12-17T12:25:17
Partition=SEA AllocNode:Sid=hpc.scu.ac.ir:20085<http://hpc.scu.ac.ir:20085>
ReqNodeList=(null) ExcNodeList=(null)
NodeList=compute-0-[0-2],hpc
BatchHost=compute-0-0
NumNodes=4 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=24,mem=4G,node=4,billing=24
Socks/Node=* NtasksPerN:B:S:C=6:0:*:* CoreSpec=*
Nodes=compute-0-[0-2],hpc CPU_IDs=0-5 Mem=1024 GRES=
MinCPUsNode=6 MinMemoryNode=1G MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/abspou/OpenFOAM/abbaspour-6/run/laminarSMOKEPhi1U1/slurm_script.sh
WorkDir=/home/abspou/OpenFOAM/abbaspour-6/run/laminarSMOKEPhi1U1
StdErr=/home/abspou/OpenFOAM/abbaspour-6/run/laminarSMOKEPhi1U1/alpha3.45U1phi1lamSmoke.log
StdIn=/dev/null
StdOut=/home/abspou/OpenFOAM/abbaspour-6/run/laminarSMOKEPhi1U1/alpha3.45U1phi1lamSmoke.log
Power=
I can not figure out what is the root of the problem.
Regards,
Mahmood
On Tue, Dec 17, 2019 at 11:18 AM Marcus Wagner <wagner at itc.rwth-aachen.de<mailto:wagner at itc.rwth-aachen.de>> wrote:
Dear Mahmood,
could you please show the output of
scontrol show -d job 119
Best
Marcus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191217/c7d71a91/attachment-0001.htm>
More information about the slurm-users
mailing list