[slurm-users] scontrol for a heterogenous job appears incorrect
Jeffrey R. Lang
JRLang at uwyo.edu
Wed Apr 24 15:24:00 UTC 2019
Chris
Upon further testing this morning I see the job is assigned two different jobid's, something I wasn't expecting. This lead me down the road of thinking the output was incorrect.
Scontrol on a hetro job will show multi-jobids for the job. So, the output just wasn't what I was expecting.
Jeff
[jrlang at tlog1 TEST_CODE]$ sbatch check_nodes.sbatch
Submitted batch job 2611773
[jrlang at tlog1 TEST_CODE]$ squeue | grep jrlang
2611773+1 teton CHECK_NO jrlang R 0:10 9 t[439-447]
2611773+0 teton-hug CHECK_NO jrlang R 0:10 1 thm03
[jrlang at tlog1 TEST_CODE]$ pestat | grep jrlang
t439 teton alloc 32 32 0.02* 128000 119594 2611774 jrlang
t440 teton alloc 32 32 0.02* 128000 119542 2611774 jrlang
t441 teton alloc 32 32 0.01* 128000 119760 2611774 jrlang
t442 teton alloc 32 32 0.01* 128000 121491 2611774 jrlang
t443 teton alloc 32 32 0.02* 128000 119893 2611774 jrlang
t444 teton alloc 32 32 0.02* 128000 119607 2611774 jrlang
t445 teton alloc 32 32 0.03* 128000 119626 2611774 jrlang
t446 teton alloc 32 32 0.01* 128000 119882 2611774 jrlang
t447 teton alloc 32 32 0.01* 128000 120037 2611774 jrlang
thm03 teton-hugemem mix 1 32 0.01* 1024000 1017845 2611773 jrlang
[jrlang at tlog1 TEST_CODE]$ scontrol show job 2611773
JobId=2611773 PackJobId=2611773 PackJobOffset=0 JobName=CHECK_NODE
PackJobIdSet=2611773-2611774
UserId=jrlang(10024903) GroupId=jrlang(10024903) MCS_label=N/A
Priority=1004 Nice=0 Account=arcc QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:01:59 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=2019-04-24T09:03:00 EligibleTime=2019-04-24T09:03:00
AccrueTime=2019-04-24T09:03:00
StartTime=2019-04-24T09:03:20 EndTime=2019-04-24T10:03:20 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-04-24T09:03:20
Partition=teton-hugemem AllocNode:Sid=tlog1:24498
ReqNodeList=(null) ExcNodeList=(null)
NodeList=thm03
BatchHost=thm03
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=1000M,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=1000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/pfs/tsfs1/home/jrlang/TEST_CODE/check_nodes.sbatch
WorkDir=/pfs/tsfs1/home/jrlang/TEST_CODE
StdErr=/pfs/tsfs1/home/jrlang/TEST_CODE/slurm-2611773.out
StdIn=/dev/null
StdOut=/pfs/tsfs1/home/jrlang/TEST_CODE/slurm-2611773.out
Power=
JobId=2611774 PackJobId=2611773 PackJobOffset=1 JobName=CHECK_NODE
PackJobIdSet=2611773-2611774
UserId=jrlang(10024903) GroupId=jrlang(10024903) MCS_label=N/A
Priority=1086 Nice=0 Account=arcc QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:01:59 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=2019-04-24T09:03:00 EligibleTime=2019-04-24T09:03:00
AccrueTime=2019-04-24T09:03:00
StartTime=2019-04-24T09:03:20 EndTime=2019-04-24T10:03:20 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2019-04-24T09:03:20
Partition=teton AllocNode:Sid=tlog1:24498
ReqNodeList=(null) ExcNodeList=(null)
NodeList=t[439-447]
BatchHost=t439
NumNodes=9 NumCPUs=288 NumTasks=288 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=288,mem=288000M,node=9,billing=288
Socks/Node=* NtasksPerN:B:S:C=32:0:*:* CoreSpec=*
MinCPUsNode=32 MinMemoryCPU=1000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/pfs/tsfs1/home/jrlang/TEST_CODE/check_nodes.sbatch
WorkDir=/pfs/tsfs1/home/jrlang/TEST_CODE
StdErr=/pfs/tsfs1/home/jrlang/TEST_CODE/slurm-2611774.out
StdIn=/dev/null
StdOut=/pfs/tsfs1/home/jrlang/TEST_CODE/slurm-2611774.out
Power=
-----Original Message-----
From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Chris Samuel
Sent: Tuesday, April 23, 2019 7:39 PM
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] scontrol for a heterogenous job appears incorrect
◆ This message was sent from a non-UWYO address. Please exercise caution when clicking links or opening attachments from external sources.
On 23/4/19 3:02 pm, Jeffrey R. Lang wrote:
> Looking at the nodelist and the NumNodes they are both incorrect. They
> should show the first node and then the additional nodes assigned.
You're only looking at the second of the two pack jobs for your submission, could they be assigned in the first one of the pack jobs instead?
All the best,
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA
More information about the slurm-users
mailing list