<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">在 2021/10/12 21:21, Adam Xu 写道:<br>
</div>
<blockquote type="cite"
cite="mid:ae6124e8-cef5-bfcf-9c82-4f0c14257826@adagene.com.cn">Hi
All,
<br>
<br>
OS: Rocky Linux 8.4
<br>
<br>
slurm version: 20.11.7
<br>
<br>
the partition's name is apollo. the node's name is apollo too. the
node has 36 cpu cores and 8GPUs in it.
<br>
<br>
partition info
<br>
<br>
$ scontrol show partition apollo
<br>
PartitionName=apollo
<br>
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
<br>
AllocNodes=ALL Default=NO QoS=N/A
<br>
DefaultTime=NONE DisableRootJobs=NO ExclusiveUser=NO
GraceTime=0 Hidden=NO
<br>
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO
MaxCPUsPerNode=UNLIMITED
<br>
Nodes=apollo
<br>
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO
OverSubscribe=YES:36
<br>
OverTimeLimit=NONE PreemptMode=OFF
<br>
State=UP TotalCPUs=36 TotalNodes=1 SelectTypeParameters=NONE
<br>
JobDefaults=(null)
<br>
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
<br>
<br>
node info
<br>
<br>
$ scontrol show node apollo
<br>
NodeName=apollo Arch=x86_64 CoresPerSocket=18
<br>
CPUAlloc=28 CPUTot=36 CPULoad=7.02
<br>
AvailableFeatures=(null)
<br>
ActiveFeatures=(null)
<br>
Gres=gpu:v100:8,mps:v100:800
<br>
NodeAddr=apollo NodeHostName=apollo Version=20.11.7
<br>
OS=Linux 4.18.0-305.19.1.el8_4.x86_64 #1 SMP Wed Sep 15
19:12:32 UTC 2021
<br>
RealMemory=1 AllocMem=0 FreeMem=47563 Sockets=2 Boards=1
<br>
State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A
MCS_label=N/A
<br>
Partitions=apollo
<br>
BootTime=2021-09-20T23:43:49
SlurmdStartTime=2021-10-12T16:55:44
<br>
CfgTRES=cpu=36,mem=1M,billing=36
<br>
AllocTRES=cpu=28
<br>
CapWatts=n/a
<br>
CurrentWatts=0 AveWatts=0
<br>
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
<br>
Comment=(null)
<br>
<br>
Now I have 7 jobs running but when I submit 8th job, the status of
the job is pending beacuse Resources.
<br>
<br>
$ squeue
<br>
JOBID PARTITION NAME USER ST TIME NODES
NODELIST(REASON)
<br>
879 apollo do.sh zhining_ PD 0:00 1
(Resources)
<br>
489 apollo do.sh zhining_ R 13-12:50:45 1
apollo
<br>
490 apollo do.sh zhining_ R 13-12:41:00 1
apollo
<br>
592 apollo runme-gp junwen_f R 4-12:42:31 1
apollo
<br>
751 apollo runme-gp junwen_f R 1-12:48:20 1
apollo
<br>
752 apollo runme-gp junwen_f R 1-12:48:10 1
apollo
<br>
871 apollo runme-gp junwen_f R 7:13:45 1
apollo
<br>
872 apollo runme-gp junwen_f R 7:12:42 1
apollo
<br>
<br>
$ scontrol show job 879
<br>
JobId=879 JobName=do.sh
<br>
UserId=zhining_wan(1001) GroupId=zhining_wan(1001)
MCS_label=N/A
<br>
Priority=4294900882 Nice=0 Account=(null) QOS=(null)
<br>
JobState=PENDING Reason=Resources Dependency=(null)
<br>
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
<br>
RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A
<br>
SubmitTime=2021-10-12T16:29:29 EligibleTime=2021-10-12T16:29:29
<br>
AccrueTime=2021-10-12T16:29:29
<br>
StartTime=2021-10-12T21:17:41 EndTime=Unknown Deadline=N/A
<br>
SuspendTime=None SecsPreSuspend=0
LastSchedEval=2021-10-12T21:17:39
<br>
Partition=apollo AllocNode:Sid=sms:1281191
<br>
ReqNodeList=(null) ExcNodeList=(null)
<br>
NodeList=(null) SchedNodeList=apollo
<br>
NumNodes=1-1 NumCPUs=4 NumTasks=4 CPUs/Task=1
ReqB:S:C:T=0:0:*:*
<br>
TRES=cpu=4,node=1,billing=4
<br>
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
<br>
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
<br>
Features=(null) DelayBoot=00:00:00
<br>
OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null)
<br>
Command=/home/zhining_wan/job/2021/20210603_ctla4_double_bilayer/final_pdb_minimize/amber/nolipid/test/do.sh
<br>
WorkDir=/home/zhining_wan/job/2021/20210603_ctla4_double_bilayer/final_pdb_minimize/amber/nolipid/test
<br>
StdErr=/home/zhining_wan/job/2021/20210603_ctla4_double_bilayer/final_pdb_minimize/amber/nolipid/test/slurm-879.out
<br>
StdIn=/dev/null
<br>
StdOut=/home/zhining_wan/job/2021/20210603_ctla4_double_bilayer/final_pdb_minimize/amber/nolipid/test/slurm-879.out
<br>
Power=
<br>
TresPerNode=gpu:1
<br>
NtasksPerTRES:0
<br>
<br>
After running 7 jobs, the node has 8 cpu cores and 1 gpu left, so
I can be sure that the remaining resources are sufficient. but why
the job is pending with reason "Resources"?
<br>
</blockquote>
<p>Some information to add:</p>
<p>I have killed some jobs with kill instead of scancle, <span
class="VIiyi" lang="en"><span class="JLqJ4b ChMk0b"
data-language-for-alternatives="en"
data-language-to-translate-into="zh-CN" data-phrase-index="0"
data-number-of-phrases="1"><span>Could this be the cause of
this result?</span></span></span> </p>
</body>
</html>