[slurm-users] Question about memory allocation

Marcus Wagner wagner at itc.rwth-aachen.de
Tue Dec 17 09:52:40 UTC 2019


Dear Mahmood,

I'm not aware of any nodes, that have 32, or even 10 sockets. Are you 
sure, you want to use the cluster like that?

Best
Marcus

On 12/17/19 10:03 AM, Mahmood Naderan wrote:
> Please see the latest update
>
> # for i in {0..2}; do scontrol show node compute-0-$i | grep 
> RealMemory; done && scontrol show node hpc | grep RealMemory
>    RealMemory=64259 AllocMem=1024 FreeMem=57163 Sockets=32 Boards=1
>    RealMemory=120705 AllocMem=1024 FreeMem=97287 Sockets=32 Boards=1
>    RealMemory=64259 AllocMem=1024 FreeMem=40045 Sockets=32 Boards=1
>    RealMemory=64259 AllocMem=1024 FreeMem=24154 Sockets=10 Boards=1
>
>
>
> $ sbatch slurm_qe.sh
> Submitted batch job 125
> $ squeue
>              JOBID PARTITION     NAME     USER ST       TIME  NODES 
> NODELIST(REASON)
>                125       SEA    qe-fb  mahmood PD       0:00      4 
> (Resources)
>                124       SEA   U1phi1 abspou     R 3:52      4 
> compute-0-[0-2],hpc
> $ scontrol show -d job 125
> JobId=125 JobName=qe-fb
>    UserId=mahmood(1000) GroupId=mahmood(1000) MCS_label=N/A
>    Priority=1751 Nice=0 Account=fish QOS=normal WCKey=*default
>    JobState=PENDING Reason=Resources Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>    DerivedExitCode=0:0
>    RunTime=00:00:00 TimeLimit=30-00:00:00 TimeMin=N/A
>    SubmitTime=2019-12-17T12:29:08 EligibleTime=2019-12-17T12:29:08
>    AccrueTime=2019-12-17T12:29:08
>    StartTime=Unknown EndTime=Unknown Deadline=N/A
>    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-12-17T12:29:09
>    Partition=SEA AllocNode:Sid=hpc.scu.ac.ir:22742 
> <http://hpc.scu.ac.ir:22742>
>    ReqNodeList=(null) ExcNodeList=(null)
>    NodeList=(null)
>    NumNodes=4-4 NumCPUs=20 NumTasks=20 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=20,mem=40G,node=4,billing=20
>    Socks/Node=* NtasksPerN:B:S:C=5:0:*:* CoreSpec=*
>    MinCPUsNode=5 MinMemoryNode=10G MinTmpDiskNode=0
>    Features=(null) DelayBoot=00:00:00
>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>    Command=/home/mahmood/qe/f_borophene/slurm_qe.sh
>    WorkDir=/home/mahmood/qe/f_borophene
>    StdErr=/home/mahmood/qe/f_borophene/my_fb.log
>    StdIn=/dev/null
>    StdOut=/home/mahmood/qe/f_borophene/my_fb.log
>    Power=
>
> $ cat slurm_qe.sh
> #!/bin/bash
> #SBATCH --job-name=qe-fb
> #SBATCH --output=my_fb.log
> #SBATCH --partition=SEA
> #SBATCH --account=fish
> #SBATCH --mem=10GB
> #SBATCH --nodes=4
> #SBATCH --ntasks-per-node=5
> mpirun -np $SLURM_NTASKS /share/apps/q-e-qe-6.5/bin/pw.x -in 
> f_borophene_scf.in <http://f_borophene_scf.in>
>
>
>
>
> You can also see the job detail of 124
>
>
> $ scontrol show -d job 124
> JobId=124 JobName=U1phi1
>    UserId= abspou(1002) GroupId= abspou(1002) MCS_label=N/A
>    Priority=958 Nice=0 Account=fish QOS=normal WCKey=*default
>    JobState=RUNNING Reason=None Dependency=(null)
>    Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
>    DerivedExitCode=0:0
>    RunTime=00:06:17 TimeLimit=30-00:00:00 TimeMin=N/A
>    SubmitTime=2019-12-17T12:25:17 EligibleTime=2019-12-17T12:25:17
>    AccrueTime=2019-12-17T12:25:17
>    StartTime=2019-12-17T12:25:17 EndTime=2020-01-16T12:25:17 Deadline=N/A
>    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-12-17T12:25:17
>    Partition=SEA AllocNode:Sid=hpc.scu.ac.ir:20085 
> <http://hpc.scu.ac.ir:20085>
>    ReqNodeList=(null) ExcNodeList=(null)
>    NodeList=compute-0-[0-2],hpc
>    BatchHost=compute-0-0
>    NumNodes=4 NumCPUs=24 NumTasks=24 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>    TRES=cpu=24,mem=4G,node=4,billing=24
>    Socks/Node=* NtasksPerN:B:S:C=6:0:*:* CoreSpec=*
>      Nodes=compute-0-[0-2],hpc CPU_IDs=0-5 Mem=1024 GRES=
>    MinCPUsNode=6 MinMemoryNode=1G MinTmpDiskNode=0
>    Features=(null) DelayBoot=00:00:00
>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>  Command=/home/abspou/OpenFOAM/abbaspour-6/run/laminarSMOKEPhi1U1/slurm_script.sh
>  WorkDir=/home/abspou/OpenFOAM/abbaspour-6/run/laminarSMOKEPhi1U1
>  StdErr=/home/abspou/OpenFOAM/abbaspour-6/run/laminarSMOKEPhi1U1/alpha3.45U1phi1lamSmoke.log
>    StdIn=/dev/null
>  StdOut=/home/abspou/OpenFOAM/abbaspour-6/run/laminarSMOKEPhi1U1/alpha3.45U1phi1lamSmoke.log
>    Power=
>
>
> I can not figure out what is the root of the problem.
>
>
>
> Regards,
> Mahmood
>
>
>
>
> On Tue, Dec 17, 2019 at 11:18 AM Marcus Wagner 
> <wagner at itc.rwth-aachen.de <mailto:wagner at itc.rwth-aachen.de>> wrote:
>
>     Dear Mahmood,
>
>     could you please show the output of
>
>     scontrol show -d job 119
>
>     Best
>     Marcus
>

-- 
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wagner at itc.rwth-aachen.de
www.itc.rwth-aachen.de

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20191217/3047ce2b/attachment.htm>


More information about the slurm-users mailing list