[slurm-users] estimate queue time using 'sbatch --test-only'

Feng Li li2251 at purdue.edu
Wed Sep 15 20:13:25 UTC 2021


Hi and thanks for reading this!

I am trying to estimate the queue time of a job of a certain size and walltime limit. I am doing this because our project considers multiple HPC resources and needs estimated queue time information to decide where to actually submit the job.

>From the man page of 'sbatch', I found that the "test-only" option can be used to "validate the batch script and return an estimate of when a job would be scheduled to run given the current  job queue and all the other arguments specifying the job requirements". This looks very promising to us.

I tried several launches in IU BigRed3 and TACC Stampede2 HPCs, the recorded results are shown below. (the last two columns are the estimated queue time and actual queue time). From the results, it looks like the estimated time is quite inaccurate (can be either over-estimated or under-estimated):

-----start of output
site
slurm version
partition
JobID
node
np
walltime_mins
timestamp_estimate
estimated_start
submit_time
actual_start
estimated_wait
actual_wait
stampede2
18.08.5-2
skx-normal
8436162
1
48
10
9/9/2021 16:05
9/11/2021 23:29
9/9/2021 16:08
9/9/2021 16:11
55:23:56
0:02:49
Stampede2
18.08.5-2
skx-normal
8436369
1
48
10
9/9/2021 16:51
9/12/2021 0:04
9/9/2021 16:51
9/9/2021 16:52
55:13:00
0:00:58
Stampede2
18.08.5-2
normal
8436193
1
48
10
9/9/2021 16:17
9/9/2021 18:02
9/9/2021 16:19
9/9/2021 16:19
1:45:26
0:00:02
Stampede2
18.08.5-2
normal
8436308
2
48
10
9/9/2021 16:40
9/9/2021 18:25
9/9/2021 16:41
9/9/2021 16:41
1:45:00
0:00:04
Bigred3
20.11.7
general
1727144
1
24
10
9/9/2021 17:57
9/10/2021 12:39
9/9/2021 17:59
9/9/2021 17:59
18:42:00
0:00:00
Bigred3
20.11.7
general
1734075
1
24
60
9/15/2021 14:54
9/15/2021 14:54
9/15/2021 14:54
9/15/2021 15:01
0:00:00
0:07:11
Bigred3
20.11.7
general
1734079
1
24
20
9/15/2021 15:09
9/15/2021 15:09
9/15/2021 15:09
9/15/2021 15:09
0:00:00
0:00:01
Bigred3
20.11.7
general
1734081
4
24
60
9/15/2021 15:11
9/15/2021 15:11
9/15/2021 15:11
9/15/2021 15:34
0:00:00
0:22:15
-----end of output

Could you suggest better ways to estimating the queue time? Or are there any specific configurations/situations on those systems on those systems that might affect the qeueue time estimation?  (e.g. fair sharing and site-specific QoS settings?)

Below is an example of my measurement for your information:

-----begin of example
lifen at elogin1(:):~$date && sbatch --test-only -n 24 -N 4 -p general -t 00:60:00 --wrap "hostname"
Wed Sep 15 15:11:49 EDT 2021
sbatch: Job 1734080 to start at 2021-09-15T15:11:49 using 24 processors on nodes nid00[935-938] in partition general
lifen at elogin1(:):~$date && sbatch -n 24 -N 4 -p general -t 00:60:00 --wrap "hostname"
Wed Sep 15 15:11:58 EDT 2021
Submitted batch job 1734081
lifen at elogin1(:):~$sacct --format=User,JobID,Jobname,partition,state,time,start,end,elapsed,MaxRss,MaxVMSize,nnodes,ncpus,nodelist -j 1734081
     User        JobID    JobName  Partition      State  Timelimit               Start                 End    Elapsed     MaxRSS  MaxVMSize   NNodes      NCPUS        NodeList
--------- ------------ ---------- ---------- ---------- ---------- ------------------- ------------------- ---------- ---------- ---------- -------- ---------- ---------------
    lifen 1734081            wrap    general  COMPLETED   01:00:00 2021-09-15T15:34:13 2021-09-15T15:34:13   00:00:00                              4         24 nid00[169,883,+
          1734081.bat+      batch             COMPLETED            2021-09-15T15:34:13 2021-09-15T15:34:13   00:00:00      2136K    226420K        1         18        nid00169
          1734081.ext+     extern             COMPLETED            2021-09-15T15:34:13 2021-09-15T15:34:13   00:00:00         4K         4K        4         24 nid00[169,883,+
-----end of example

Thanks,
Feng Li
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20210915/84214740/attachment-0001.htm>


More information about the slurm-users mailing list