[slurm-users] Information about finished jobs

Paul Raines raines at nmr.mgh.harvard.edu
Mon Jun 14 18:12:17 UTC 2021


I have been writing my own 'jobinfo' tool for users to see info on
a job in any state that is useful and readable by them.  Still
new to slurm and trying to wrap my head around the database info
and the effects of arrays and such.

A completed job output looks like this:

# jobinfo 357300
--------------------------------------------------
           JobID : 356847_361          | 356847_361.batch
         JobName : batch_compile_loraks.sh
            User : gr879
         Account : syhdiff
       Partition : basic
         ReqTRES : billing=1,cpu=1,mem=40G,node=1
       AllocTRES : billing=1,cpu=1,mem=40G,node=1
        NodeList : r440-19
          Submit : 2021-06-13T22:07:07
           Start : 2021-06-14T01:47:55 | 2021-06-14T01:47:55
             End : 2021-06-14T05:22:00 | 2021-06-14T05:22:00
       Timelimit : 2-00:00:00
         Elapsed : 03:34:05            | 03:34:05
         CPUTime : 03:34:05            | 03:34:05
       SystemCPU : 05:57.056           | 05:57.056
         UserCPU : 03:27:07            | 03:27:07
        TotalCPU : 03:33:04            | 03:33:04
     MaxDiskRead :                     | 109.25M
    MaxDiskWrite :                     | 1.08M
          MaxRSS :                     | 32529204K
       MaxVMSize :                     | 61834112K
           State : COMPLETED           | COMPLETED
        ExitCode : 0:0                 | 0:0
         WorkDir : /autofs/homes/002/gr879/matlab/ex_vivo/batch_code

and a typical RUNNING job looks like

# jobinfo 357304
--------------------------------------------------
           JobID : 357199_21           | 357199_21.batch
         JobName : batch_compile_multi_shell.sh
            User : gr879
         Account : syhdiff
       Partition : basic
         ReqTRES : billing=1,cpu=1,mem=12G,node=1
       AllocTRES : billing=1,cpu=1,mem=12G,node=1
        NodeList : r440-17
          Submit : 2021-06-14T00:31:11
           Start : 2021-06-14T01:47:55 | 2021-06-14T01:47:55
             End : Unknown             | Unknown
       Timelimit : 1-00:00:00
         Elapsed : 12:04:35            | 12:04:35
         CPUTime : 12:04:35            | 12:04:35
       SystemCPU : 00:00:00            | 00:00:00
         UserCPU : 00:00:00            | 00:00:00
        TotalCPU : 00:00:00            | 12:01:46
     MaxDiskRead :                     | 101176763
    MaxDiskWrite :                     | 1259187
          MaxRSS :                     | 5455M
       MaxVMSize :                     | 10823600K
           State : RUNNING             | RUNNING
        ExitCode : 0:0                 | 0:0
         WorkDir : /autofs/homes/002/gr879/matlab/ex_vivo/batch_code

Where unfortunately I have to give zeros on certain info
I cannot get yet.  My current issue is with that TotalCPU row
on running jobs.  I actually get that from AveCPU from sstat and
in the case above it looks right.  But in others it is just way off

# jobinfo 357305
--------------------------------------------------
           JobID : 357305              | 357305.batch
         JobName : sjob_185
            User : mjk2
         Account : circgp
       Partition : basic
         ReqTRES : billing=27,cpu=20,mem=370G,node=1
       AllocTRES : billing=27,cpu=20,mem=370G,node=1
        NodeList : r440-05
          Submit : 2021-06-14T01:44:56
           Start : 2021-06-14T05:02:10 | 2021-06-14T05:02:10
             End : Unknown             | Unknown
       Timelimit : 7-00:00:00
         Elapsed : 08:50:17            | 08:50:17
         CPUTime : 7-08:45:40          | 7-08:45:40
       SystemCPU : 00:00:00            | 00:00:00
         UserCPU : 00:00:00            | 00:00:00
        TotalCPU : 00:00:00            | 11:33.000
     MaxDiskRead :                     | 79699046
    MaxDiskWrite :                     | 17983
          MaxRSS :                     | 81357340K
       MaxVMSize :                     | 104992372K
           State : RUNNING             | RUNNING
        ExitCode : 0:0                 | 0:0
         WorkDir : /autofs/homes/002/mjk2

In this job the user asked for 20 cores, but I can see his
job is only one one core on the actual node so this is a big waste.
But that core is constantly going 100% so I would expect AveCPU
to be close to the Elapsed time but is is way less (11 minutes
instead of nearly 9 hours)

# /usr/bin/sstat -p -a --job=357305 --format=JobID,AveCPU
JobID|AveCPU|
357305.extern|213503982334-14:25:51|
357305.batch|11:33.000|

Any idea why this is?  Also, what is that crazy number for
AveCPU on 357305.extern?

-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Mon, 14 Jun 2021 2:45am, Ole Holm Nielsen wrote:

> On 6/14/21 8:26 AM, Gestió Servidors wrote:
>>  How can I get all information about a finished job in the same way as
>>  “scontrol show jobid=” when job is pending or running?
>
> Some minutes after job completion, you can only get the information which is 
> stored in the Slurm database.
>
> My script "showjob" in 
> https://github.com/OleHolmNielsen/Slurm_tools/tree/master/jobs shows all 
> available information for jobs in the queue as well as in the database.
>
> /Ole
>
>
>
>


More information about the slurm-users mailing list