[slurm-users] scontrol for a heterogenous job appears incorrect

Jeffrey R. Lang JRLang at uwyo.edu
Wed Apr 24 15:24:00 UTC 2019


Chris

Upon further testing this morning I see the job is assigned two different jobid's, something I wasn't expecting.  This lead me down the road  of thinking the output was incorrect.

Scontrol on a hetro job will show multi-jobids for the job. So, the output just wasn't what I was expecting.

Jeff

[jrlang at tlog1 TEST_CODE]$ sbatch check_nodes.sbatch
Submitted batch job 2611773
 [jrlang at tlog1 TEST_CODE]$ squeue | grep jrlang
         2611773+1     teton CHECK_NO   jrlang  R       0:10      9 t[439-447]
         2611773+0 teton-hug CHECK_NO   jrlang  R       0:10      1 thm03
[jrlang at tlog1 TEST_CODE]$ pestat | grep jrlang
    t439           teton    alloc  32  32    0.02*   128000   119594  2611774 jrlang  
    t440           teton    alloc  32  32    0.02*   128000   119542  2611774 jrlang  
    t441           teton    alloc  32  32    0.01*   128000   119760  2611774 jrlang  
    t442           teton    alloc  32  32    0.01*   128000   121491  2611774 jrlang  
    t443           teton    alloc  32  32    0.02*   128000   119893  2611774 jrlang  
    t444           teton    alloc  32  32    0.02*   128000   119607  2611774 jrlang  
    t445           teton    alloc  32  32    0.03*   128000   119626  2611774 jrlang  
    t446           teton    alloc  32  32    0.01*   128000   119882  2611774 jrlang  
    t447           teton    alloc  32  32    0.01*   128000   120037  2611774 jrlang  
   thm03   teton-hugemem      mix   1  32    0.01*  1024000  1017845  2611773 jrlang  
[jrlang at tlog1 TEST_CODE]$ scontrol show job 2611773
JobId=2611773 PackJobId=2611773 PackJobOffset=0 JobName=CHECK_NODE
   PackJobIdSet=2611773-2611774
   UserId=jrlang(10024903) GroupId=jrlang(10024903) MCS_label=N/A
   Priority=1004 Nice=0 Account=arcc QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:01:59 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2019-04-24T09:03:00 EligibleTime=2019-04-24T09:03:00
   AccrueTime=2019-04-24T09:03:00
   StartTime=2019-04-24T09:03:20 EndTime=2019-04-24T10:03:20 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-04-24T09:03:20
   Partition=teton-hugemem AllocNode:Sid=tlog1:24498
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=thm03
   BatchHost=thm03
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=1000M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=1000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/pfs/tsfs1/home/jrlang/TEST_CODE/check_nodes.sbatch
   WorkDir=/pfs/tsfs1/home/jrlang/TEST_CODE
   StdErr=/pfs/tsfs1/home/jrlang/TEST_CODE/slurm-2611773.out
   StdIn=/dev/null
   StdOut=/pfs/tsfs1/home/jrlang/TEST_CODE/slurm-2611773.out
   Power=

JobId=2611774 PackJobId=2611773 PackJobOffset=1 JobName=CHECK_NODE
   PackJobIdSet=2611773-2611774
   UserId=jrlang(10024903) GroupId=jrlang(10024903) MCS_label=N/A
   Priority=1086 Nice=0 Account=arcc QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:01:59 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=2019-04-24T09:03:00 EligibleTime=2019-04-24T09:03:00
   AccrueTime=2019-04-24T09:03:00
   StartTime=2019-04-24T09:03:20 EndTime=2019-04-24T10:03:20 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2019-04-24T09:03:20
   Partition=teton AllocNode:Sid=tlog1:24498
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=t[439-447]
   BatchHost=t439
   NumNodes=9 NumCPUs=288 NumTasks=288 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=288,mem=288000M,node=9,billing=288
   Socks/Node=* NtasksPerN:B:S:C=32:0:*:* CoreSpec=*
   MinCPUsNode=32 MinMemoryCPU=1000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/pfs/tsfs1/home/jrlang/TEST_CODE/check_nodes.sbatch
   WorkDir=/pfs/tsfs1/home/jrlang/TEST_CODE
   StdErr=/pfs/tsfs1/home/jrlang/TEST_CODE/slurm-2611774.out
   StdIn=/dev/null
   StdOut=/pfs/tsfs1/home/jrlang/TEST_CODE/slurm-2611774.out
   Power=


-----Original Message-----
From: slurm-users <slurm-users-bounces at lists.schedmd.com> On Behalf Of Chris Samuel
Sent: Tuesday, April 23, 2019 7:39 PM
To: slurm-users at lists.schedmd.com
Subject: Re: [slurm-users] scontrol for a heterogenous job appears incorrect

◆ This message was sent from a non-UWYO address. Please exercise caution when clicking links or opening attachments from external sources.


On 23/4/19 3:02 pm, Jeffrey R. Lang wrote:

> Looking at the nodelist and the NumNodes they are both incorrect.   They
> should show the first node and then the additional nodes assigned.

You're only looking at the second of the two pack jobs for your submission, could they be assigned in the first one of the pack jobs instead?

All the best,
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



More information about the slurm-users mailing list