[slurm-users] Poor scheduler performance with moderate number of jobs
Kevin M. Hildebrand
kevin at umd.edu
Mon Jun 11 08:10:15 MDT 2018
We're seeing some pretty bad performance with around 3000 jobs in queue.
We're using sched/backfill, and I've been tweaking the bf_ parameters
to try and improve some things, with limited results.
But even before the backfill process starts, the main scheduling loop
is taking so long per job that it's not even completing the scheduling
of a single job before timing out.
>From watching the logs, when backfill does run it's taking 17 seconds
or more to test each job, so it doesn't get very far either.
One thing of note, most of the 3000 jobs are serial jobs in our
scavenger queue, which is preemptible.
We're currently running SLURM version 16.05.1, a bit behind, I know.
Any pointers on what to look at would be appreciated.
Thanks,
Kevin
--
Kevin Hildebrand
University of Maryland
Division of IT
-------------- next part --------------
*******************************************************
sdiag output at Mon Jun 11 10:09:17 2018
Data since Mon Jun 11 09:49:18 2018
*******************************************************
Server thread count: 7
Agent queue size: 0
Jobs submitted: 7
Jobs started: 1
Jobs completed: 3
Jobs canceled: 3
Jobs failed: 0
Main schedule statistics (microseconds):
Last cycle: 20017144
Max cycle: 20897866
Total cycles: 41
Mean cycle: 18957323
Mean depth cycle: 1
Cycles per minute: 2
Last queue length: 11
Backfilling stats (WARNING: data obtained in the middle of backfilling execution.)
Total backfilled jobs (since last slurm start): 1
Total backfilled jobs (since last stats cycle start): 1
Total cycles: 2
Last cycle when: Mon Jun 11 10:07:15 2018
Last cycle: 224511653
Max cycle: 224511653
Mean cycle: 213405214
Last depth cycle: 3
Last depth cycle (try sched): 3
Depth Mean: 4
Depth Mean (try depth): 4
Last queue length: 11
Queue length mean: 12
Remote Procedure Call statistics by message type
MESSAGE_NODE_REGISTRATION_STATUS ( 1002) count:843 ave_time:4903158 total_time:4133362640
REQUEST_PARTITION_INFO ( 2009) count:98 ave_time:3133 total_time:307125
REQUEST_JOB_INFO ( 2003) count:94 ave_time:14811231 total_time:1392255803
REQUEST_NODE_INFO ( 2007) count:86 ave_time:9364554 total_time:805351683
REQUEST_NODE_INFO_SINGLE ( 2040) count:61 ave_time:13399191 total_time:817350682
REQUEST_UPDATE_NODE ( 3002) count:42 ave_time:2736748 total_time:114943454
REQUEST_JOB_STEP_INFO ( 2005) count:38 ave_time:1480198 total_time:56247529
MESSAGE_EPILOG_COMPLETE ( 6012) count:31 ave_time:16408225 total_time:508655003
REQUEST_JOB_INFO_SINGLE ( 2021) count:28 ave_time:10754115 total_time:301115228
REQUEST_COMPLETE_PROLOG ( 6018) count:20 ave_time:18418352 total_time:368367040
REQUEST_SUBMIT_BATCH_JOB ( 4003) count:7 ave_time:12642347 total_time:88496435
REQUEST_STATS_INFO ( 2035) count:5 ave_time:213 total_time:1069
REQUEST_STEP_COMPLETE ( 5016) count:4 ave_time:13680174 total_time:54720699
REQUEST_BUILD_INFO ( 2001) count:4 ave_time:369 total_time:1478
REQUEST_COMPLETE_BATCH_SCRIPT ( 5018) count:3 ave_time:6582324 total_time:19746973
REQUEST_JOB_STEP_CREATE ( 5001) count:3 ave_time:6392575 total_time:19177726
REQUEST_KILL_JOB ( 5032) count:3 ave_time:6208358 total_time:18625074
REQUEST_JOB_ALLOCATION_INFO_LITE ( 4016) count:3 ave_time:21989479 total_time:65968437
REQUEST_SHARE_INFO ( 2022) count:3 ave_time:1449 total_time:4348
REQUEST_JOB_USER_INFO ( 2039) count:1 ave_time:49483669 total_time:49483669
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slurm.conf
Type: application/octet-stream
Size: 6606 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180611/8bef1f0f/attachment.obj>
More information about the slurm-users
mailing list