<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">在 2021/10/13 9:22, Brian Andrus 写道:<br>
    </div>
    <blockquote type="cite"
      cite="mid:78EA9C87-20EE-4392-A3B5-4EA2E468C664@hxcore.ol">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <meta name="Generator" content="Microsoft Word 15 (filtered
        medium)">
      <style>@font-face
        {font-family:"MS Mincho";
        panose-1:2 2 6 9 4 2 5 8 3 4;}@font-face
        {font-family:"MS Gothic";
        panose-1:2 11 6 9 7 2 5 8 2 4;}@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}@font-face
        {font-family:"\@MS Gothic";
        panose-1:2 11 6 9 7 2 5 8 2 4;}@font-face
        {font-family:"\@MS Mincho";
        panose-1:2 2 6 9 4 2 5 8 3 4;}p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}span.jlqj4b
        {mso-style-name:jlqj4b;}.MsoChpDefault
        {mso-style-type:export-only;}div.WordSection1
        {page:WordSection1;}</style>
      <div class="WordSection1">
        <p class="MsoNormal">Something is very odd when you have the
          node reporting:</p>
        <p class="MsoNormal" style="text-indent:.5in">RealMemory=1
          AllocMem=0 FreeMem=47563 Sockets=2 Boards=1</p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">What do you get when you run ‘slurmd -C’ on
          the node?</p>
      </div>
    </blockquote>
    # slurmd -C<br>
    NodeName=apollo CPUs=36 Boards=1 SocketsPerBoard=2 CoresPerSocket=18
    ThreadsPerCore=1 RealMemory=128306<br>
    UpTime=22-16:14:48<br>
    <blockquote type="cite"
      cite="mid:78EA9C87-20EE-4392-A3B5-4EA2E468C664@hxcore.ol">
      <div class="WordSection1">
        <p class="MsoNormal"><o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">Brian Andrus<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <div
          style="mso-element:para-border-div;border:none;border-top:solid
          #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">
          <p class="MsoNormal" style="border:none;padding:0in"><b>From:
            </b><a href="mailto:adam_xu@adagene.com.cn"
              moz-do-not-send="true">Adam Xu</a><br>
            <b>Sent: </b>Tuesday, October 12, 2021 6:07 PM<br>
            <b>To: </b><a href="mailto:slurm-users@lists.schedmd.com"
              moz-do-not-send="true" class="moz-txt-link-freetext">slurm-users@lists.schedmd.com</a><br>
            <b>Subject: </b>Re: [slurm-users] job is pending but
            resources are available</p>
        </div>
        <p class="MsoNormal"><o:p> </o:p></p>
        <div>
          <p class="MsoNormal"><span style="font-family:"MS
              Gothic"">在</span> 2021/10/12 21:21, Adam Xu <span
              style="font-family:"MS Gothic"">写道</span>:<o:p></o:p></p>
        </div>
        <blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">
          <p class="MsoNormal">Hi All, <br>
            <br>
            OS: Rocky Linux 8.4 <br>
            <br>
            slurm version: 20.11.7 <br>
            <br>
            the partition's name is apollo. the node's name is apollo
            too. the node has 36 cpu cores and 8GPUs in it. <br>
            <br>
            partition info <br>
            <br>
            $ scontrol show partition apollo <br>
            PartitionName=apollo <br>
               AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL <br>
               AllocNodes=ALL Default=NO QoS=N/A <br>
               DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO
            GraceTime=0 Hidden=NO <br>
               MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
            MaxCPUsPerNode=UNLIMITED <br>
               Nodes=apollo <br>
               PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
            OverSubscribe=YES:36 <br>
               OverTimeLimit=NONE PreemptMode=OFF <br>
               State=UP TotalCPUs=36 TotalNodes=1
            SelectTypeParameters=NONE <br>
               JobDefaults=(null) <br>
               DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED <br>
            <br>
            node info <br>
            <br>
            $ scontrol show node apollo <br>
            NodeName=apollo Arch=x86_64 CoresPerSocket=18 <br>
               CPUAlloc=28 CPUTot=36 CPULoad=7.02 <br>
               AvailableFeatures=(null) <br>
               ActiveFeatures=(null) <br>
               Gres=gpu:v100:8,mps:v100:800 <br>
               NodeAddr=apollo NodeHostName=apollo Version=20.11.7 <br>
               OS=Linux 4.18.0-305.19.1.el8_4.x86_64 #1 SMP Wed Sep 15
            19:12:32 UTC 2021 <br>
               RealMemory=1 AllocMem=0 FreeMem=47563 Sockets=2 Boards=1
            <br>
               State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
            MCS_label=N/A <br>
               Partitions=apollo <br>
               BootTime=2021-09-20T23:43:49
            SlurmdStartTime=2021-10-12T16:55:44 <br>
               CfgTRES=cpu=36,mem=1M,billing=36 <br>
               AllocTRES=cpu=28 <br>
               CapWatts=n/a <br>
               CurrentWatts=0 AveWatts=0 <br>
               ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
            <br>
               Comment=(null) <br>
            <br>
            Now I have 7 jobs running but when I submit 8th job, the
            status of the job is pending beacuse Resources. <br>
            <br>
            $ squeue <br>
                         JOBID PARTITION     NAME     USER ST       TIME
            NODES NODELIST(REASON) <br>
                           879    apollo    do.sh zhining_ PD       0:00
            1 (Resources) <br>
                           489    apollo    do.sh zhining_  R
            13-12:50:45 1 apollo <br>
                           490    apollo    do.sh zhining_  R
            13-12:41:00 1 apollo <br>
                           592    apollo runme-gp junwen_f  R 4-12:42:31
            1 apollo <br>
                           751    apollo runme-gp junwen_f  R 1-12:48:20
            1 apollo <br>
                           752    apollo runme-gp junwen_f  R 1-12:48:10
            1 apollo <br>
                           871    apollo runme-gp junwen_f  R    7:13:45
            1 apollo <br>
                           872    apollo runme-gp junwen_f  R    7:12:42
            1 apollo <br>
            <br>
            $ scontrol show job 879 <br>
            JobId=879 JobName=do.sh <br>
               UserId=zhining_wan(1001) GroupId=zhining_wan(1001)
            MCS_label=N/A <br>
               Priority=4294900882 Nice=0 Account=(null) QOS=(null) <br>
               JobState=PENDING Reason=Resources Dependency=(null) <br>
               Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 <br>
               RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A <br>
               SubmitTime=2021-10-12T16:29:29
            EligibleTime=2021-10-12T16:29:29 <br>
               AccrueTime=2021-10-12T16:29:29 <br>
               StartTime=2021-10-12T21:17:41 EndTime=Unknown
            Deadline=N/A <br>
               SuspendTime=None SecsPreSuspend=0
            LastSchedEval=2021-10-12T21:17:39 <br>
               Partition=apollo AllocNode:Sid=sms:1281191 <br>
               ReqNodeList=(null) ExcNodeList=(null) <br>
               NodeList=(null) SchedNodeList=apollo <br>
               NumNodes=1-1 NumCPUs=4 NumTasks=4 CPUs/Task=1
            ReqB:S:C:T=0:0:*:* <br>
               TRES=cpu=4,node=1,billing=4 <br>
               Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* <br>
               MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 <br>
               Features=(null) DelayBoot=00:00:00 <br>
               OverSubscribe=YES Contiguous=0 Licenses=(null)
            Network=(null) <br>
Command=/home/zhining_wan/job/2021/20210603_ctla4_double_bilayer/final_pdb_minimize/amber/nolipid/test/do.sh
            <br>
WorkDir=/home/zhining_wan/job/2021/20210603_ctla4_double_bilayer/final_pdb_minimize/amber/nolipid/test
            <br>
StdErr=/home/zhining_wan/job/2021/20210603_ctla4_double_bilayer/final_pdb_minimize/amber/nolipid/test/slurm-879.out
            <br>
               StdIn=/dev/null <br>
StdOut=/home/zhining_wan/job/2021/20210603_ctla4_double_bilayer/final_pdb_minimize/amber/nolipid/test/slurm-879.out
            <br>
               Power= <br>
               TresPerNode=gpu:1 <br>
               NtasksPerTRES:0 <br>
            <br>
            After running 7 jobs, the node has 8 cpu cores and 1 gpu
            left, so I can be sure that the remaining resources are
            sufficient. but why the job is pending with reason
            "Resources"? <o:p></o:p></p>
        </blockquote>
        <p>Some information to add<span style="font-family:"MS
            Mincho"">:</span></p>
        <p>I have killed some jobs with kill instead of scancle, <span
            class="jlqj4b"><span lang="EN">Could this be the cause of
              this result?</span></span><span lang="EN"> </span></p>
        <p class="MsoNormal"><o:p> </o:p></p>
      </div>
    </blockquote>
  </body>
</html>