[slurm-users] sacct runtime performance varies on job status codes
John Snowdon
John.Snowdon at newcastle.ac.uk
Fri Sep 1 07:59:57 UTC 2023
Hi,
I am attempting to pull some historical information from our HPC system to analyse some trends of our users over time.
As part of this I am using sacct to make a number of queries for different jobs statuses (running, pending, completed, 'other') over particular time periods (hourly, daily, etc).
I have noticed that most of my results with sacct return in the order of a few hundred milliseconds, regardless of rows (anywhere from none to several thousand).
However there are two distinct job status codes that result in a huge delay of between 30 seconds to over 1 minute, irrespective of the number of rows returned.
Any job status code in the list of R,CD,CA,DL,F,NF,PR,RS,RV,OOM,TO returns quickly, but PD and S queries are inordinately slow. Examples:
# Jobs in running state:
$ time sacct -X -v -p -a -S 2023-09-0100:00:00 -E 2023-09-0100:59:59 --state=R | wc -l
sacct: Jobs RUNNING in the time window from Fri Sep 01 00:00:00 2023 to Fri Sep 01 00:59:59 2023
sacct: accounting_storage/slurmdbd: init: Accounting storage SLURMDBD plugin loaded
281
real 0m0.095s
user 0m0.032s
sys 0m0.012s
# Jobs with an 'abnormal' state:
$ time sacct -X -v -p -a -S 2023-09-0100:00:00 -E 2023-09-0100:59:59 --state=CA,DL,F,NF,PR,RS,RV,OOM,TO | wc -l
sacct: Jobs CANCELLED,DEADLINE,FAILED,NODE_FAIL,PREEMPTED,PENDING,RESIZING,PENDING,REVOKED,OUT_OF_MEMORY,TIMEOUT in the time window from Fri Sep 01 00:00:00 2023 to Fri Sep 01 00:59:59 2023
sacct: accounting_storage/slurmdbd: init: Accounting storage SLURMDBD plugin loaded
132
real 0m0.088s
user 0m0.033s
sys 0m0.014s
... but looking at suspended or pending job states:
$ time sacct -X -v -p -a -S 2023-09-0100:00:00 -E 2023-09-0100:59:59 --state=PD | wc -l
sacct: Jobs PENDING in the time window from Fri Sep 01 00:00:00 2023 to Fri Sep 01 00:59:59 2023
sacct: accounting_storage/slurmdbd: init: Accounting storage SLURMDBD plugin loaded
2000
real 0m45.712s
user 0m0.041s
sys 0m0.013s
$ time sacct -X -v -p -a -S 2023-09-0100:00:00 -E 2023-09-0100:59:59 --state=S | wc -l
sacct: Jobs SUSPENDED in the time window from Fri Sep 01 00:00:00 2023 to Fri Sep 01 00:59:59 2023
sacct: accounting_storage/slurmdbd: init: Accounting storage SLURMDBD plugin loaded
1
real 1m20.490s
user 0m0.033s
sys 0m0.006s
Our sacct version reports:
$ sacct -V
slurm 20.11.8
The current performance makes my efforts to analyse the size of the tail of pending jobs (and thus one of the criteria we want to use to understand whether we are coping with user submission demand) impractical - it seems to be more than 100x slower than querying which jobs were running at any point in time.
Some things which I've observed:
- Use of start/end or the default time window doesn't matter
- Size of time window set by start/end doesn't matter
- Querying a list of status codes or single states doesn't matter (single or listed codes of everything but PD and S is fast)
Is this likely to be behaviour of the sacct client, or is there a fundamental difference in the database schema that somehow would make queries for S and PD jobs slower by several factors?
John Snowdon
Advanced Computing Consultant
Newcastle University IT Service
The Elizabeth Barraclough Building
91 Sandyford Road
Newcastle upon Tyne,
NE1 8HW
More information about the slurm-users
mailing list