[slurm-users] Poor scheduler performance with moderate number of jobs

Kevin M. Hildebrand kevin at umd.edu
Mon Jun 11 08:10:15 MDT 2018


We're seeing some pretty bad performance with around 3000 jobs in queue.
We're using sched/backfill, and I've been tweaking the bf_ parameters
to try and improve some things, with limited results.
But even before the backfill process starts, the main scheduling loop
is taking so long per job that it's not even completing the scheduling
of a single job before timing out.
>From watching the logs, when backfill does run it's taking 17 seconds
or more to test each job, so it doesn't get very far either.

One thing of note, most of the 3000 jobs are serial jobs in our
scavenger queue, which is preemptible.

We're currently running SLURM version 16.05.1, a bit behind, I know.

Any pointers on what to look at would be appreciated.

Thanks,
Kevin
--
Kevin Hildebrand
University of Maryland
Division of IT
-------------- next part --------------
*******************************************************
sdiag output at Mon Jun 11 10:09:17 2018
Data since      Mon Jun 11 09:49:18 2018
*******************************************************
Server thread count: 7
Agent queue size:    0

Jobs submitted: 7
Jobs started:   1
Jobs completed: 3
Jobs canceled:  3
Jobs failed:    0

Main schedule statistics (microseconds):
	Last cycle:   20017144
	Max cycle:    20897866
	Total cycles: 41
	Mean cycle:   18957323
	Mean depth cycle:  1
	Cycles per minute: 2
	Last queue length: 11

Backfilling stats (WARNING: data obtained in the middle of backfilling execution.)
	Total backfilled jobs (since last slurm start): 1
	Total backfilled jobs (since last stats cycle start): 1
	Total cycles: 2
	Last cycle when: Mon Jun 11 10:07:15 2018
	Last cycle: 224511653
	Max cycle:  224511653
	Mean cycle: 213405214
	Last depth cycle: 3
	Last depth cycle (try sched): 3
	Depth Mean: 4
	Depth Mean (try depth): 4
	Last queue length: 11
	Queue length mean: 12

Remote Procedure Call statistics by message type
	MESSAGE_NODE_REGISTRATION_STATUS        ( 1002) count:843    ave_time:4903158 total_time:4133362640
	REQUEST_PARTITION_INFO                  ( 2009) count:98     ave_time:3133   total_time:307125
	REQUEST_JOB_INFO                        ( 2003) count:94     ave_time:14811231 total_time:1392255803
	REQUEST_NODE_INFO                       ( 2007) count:86     ave_time:9364554 total_time:805351683
	REQUEST_NODE_INFO_SINGLE                ( 2040) count:61     ave_time:13399191 total_time:817350682
	REQUEST_UPDATE_NODE                     ( 3002) count:42     ave_time:2736748 total_time:114943454
	REQUEST_JOB_STEP_INFO                   ( 2005) count:38     ave_time:1480198 total_time:56247529
	MESSAGE_EPILOG_COMPLETE                 ( 6012) count:31     ave_time:16408225 total_time:508655003
	REQUEST_JOB_INFO_SINGLE                 ( 2021) count:28     ave_time:10754115 total_time:301115228
	REQUEST_COMPLETE_PROLOG                 ( 6018) count:20     ave_time:18418352 total_time:368367040
	REQUEST_SUBMIT_BATCH_JOB                ( 4003) count:7      ave_time:12642347 total_time:88496435
	REQUEST_STATS_INFO                      ( 2035) count:5      ave_time:213    total_time:1069
	REQUEST_STEP_COMPLETE                   ( 5016) count:4      ave_time:13680174 total_time:54720699
	REQUEST_BUILD_INFO                      ( 2001) count:4      ave_time:369    total_time:1478
	REQUEST_COMPLETE_BATCH_SCRIPT           ( 5018) count:3      ave_time:6582324 total_time:19746973
	REQUEST_JOB_STEP_CREATE                 ( 5001) count:3      ave_time:6392575 total_time:19177726
	REQUEST_KILL_JOB                        ( 5032) count:3      ave_time:6208358 total_time:18625074
	REQUEST_JOB_ALLOCATION_INFO_LITE        ( 4016) count:3      ave_time:21989479 total_time:65968437
	REQUEST_SHARE_INFO                      ( 2022) count:3      ave_time:1449   total_time:4348
	REQUEST_JOB_USER_INFO                   ( 2039) count:1      ave_time:49483669 total_time:49483669

-------------- next part --------------
A non-text attachment was scrubbed...
Name: slurm.conf
Type: application/octet-stream
Size: 6606 bytes
Desc: not available
URL: <http://lists.schedmd.com/pipermail/slurm-users/attachments/20180611/8bef1f0f/attachment.obj>


More information about the slurm-users mailing list