[slurm-users] Question about memory allocation

Sean Crosby scrosby at unimelb.edu.au
Tue Dec 17 09:31:00 UTC 2019


Hi Mahmood,

Your running job is requesting 6 CPUs per node (4 nodes, 6 CPUs per node). That means 6 CPUs are being used on node hpc.

Your queued job is requesting 5 CPUs per node (4 nodes, 5 CPUs per node). In total, if it was running, that would require 11 CPUs on node hpc. But hpc only has 10 cores, so it can't run.

Sean


On Tue, 17 Dec 2019 at 20:03, Mahmood Naderan <mahmood.nt at gmail.com<mailto:mahmood.nt at gmail.com>> wrote:
Please see the latest update

# for i in {0..2}; do scontrol show node compute-0-$i | grep RealMemory; done && scontrol show node hpc | grep RealMemory
   RealMemory=64259 AllocMem=1024 FreeMem=57163 Sockets=32 Boards=1
   RealMemory=120705 AllocMem=1024 FreeMem=97287 Sockets=32 Boards=1
   RealMemory=64259 AllocMem=1024 FreeMem=40045 Sockets=32 Boards=1
   RealMemory=64259 AllocMem=1024 FreeMem=24154 Sockets=10 Boards=1



$ sbatch slurm_qe.sh
Submitted batch job 125
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               125       SEA    qe-fb  mahmood PD       0:00      4 (Resources)
               124       SEA   U1phi1 abspou     R       3:52      4 compute-0-[0-2],hpc
$ scontrol show -d job 125
JobId=125 JobName=qe-fb
   UserId=mahmood(1000) GroupId=mahmood(1000) MCS_label=N/A
   Priority=1751 Nice=0 Account=fish QOS=normal WCKey=*default
   JobState=PENDING Reason=Resources Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=30-00:00:00 TimeMin=N/A
   SubmitTime=2019-12-17T12:29:08 EligibleTime=2019-12-17T12:29:08
   AccrueTime=2019-12-17T12:29:08
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-12-17T12:29:09
   Partition=SEA AllocNode:Sid=hpc.scu.ac.ir:22742<http://hpc.scu.ac.ir:22742>
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=4-4 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=20,mem=40G,node=4,billing=20
   Socks/Node=* NtasksPerN:B:S:C=5:0:*:* CoreSpec=*
   MinCPUsNode=5 MinMemoryNode=10G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/mahmood/qe/f_borophene/slurm_qe.sh
   WorkDir=/home/mahmood/qe/f_borophene
   StdErr=/home/mahmood/qe/f_borophene/my_fb.log
   StdIn=/dev/null
   StdOut=/home/mahmood/qe/f_borophene/my_fb.log
   Power=

$ cat slurm_qe.sh
#!/bin/bash
#SBATCH --job-name=qe-fb
#SBATCH --output=my_fb.log
#SBATCH --partition=SEA
#SBATCH --account=fish
#SBATCH --mem=10GB
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=5
mpirun -np $SLURM_NTASKS /share/apps/q-e-qe-6.5/bin/pw.x -in f_borophene_scf.in<http://f_borophene_scf.in>




You can also see the job detail of 124


$ scontrol show -d job 124
JobId=124 JobName=U1phi1
   UserId= abspou(1002) GroupId= abspou(1002) MCS_label=N/A
   Priority=958 Nice=0 Account=fish QOS=normal WCKey=*default
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:06:17 TimeLimit=30-00:00:00 TimeMin=N/A
   SubmitTime=2019-12-17T12:25:17 EligibleTime=2019-12-17T12:25:17
   AccrueTime=2019-12-17T12:25:17
   StartTime=2019-12-17T12:25:17 EndTime=2020-01-16T12:25:17 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-12-17T12:25:17
   Partition=SEA AllocNode:Sid=hpc.scu.ac.ir:20085<http://hpc.scu.ac.ir:20085>
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=compute-0-[0-2],hpc
   BatchHost=compute-0-0
   NumNodes=4 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=24,mem=4G,node=4,billing=24
   Socks/Node=* NtasksPerN:B:S:C=6:0:*:* CoreSpec=*
     Nodes=compute-0-[0-2],hpc CPU_IDs=0-5 Mem=1024 GRES=
   MinCPUsNode=6 MinMemoryNode=1G MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/abspou/OpenFOAM/abbaspour-6/run/laminarSMOKEPhi1U1/slurm_script.sh
   WorkDir=/home/abspou/OpenFOAM/abbaspour-6/run/laminarSMOKEPhi1U1
   StdErr=/home/abspou/OpenFOAM/abbaspour-6/run/laminarSMOKEPhi1U1/alpha3.45U1phi1lamSmoke.log
   StdIn=/dev/null
   StdOut=/home/abspou/OpenFOAM/abbaspour-6/run/laminarSMOKEPhi1U1/alpha3.45U1phi1lamSmoke.log
   Power=


I can not figure out what is the root of the problem.



Regards,
Mahmood




On Tue, Dec 17, 2019 at 11:18 AM Marcus Wagner <wagner at itc.rwth-aachen.de<mailto:wagner at itc.rwth-aachen.de>> wrote:
Dear Mahmood,

could you please show the output of

scontrol show -d job 119

Best
Marcus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191217/c7d71a91/attachment-0001.htm>


More information about the slurm-users mailing list